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Abstract: Consider the standard Gaussian linear regression model Y = X0o + e, where 
Y S M n is a response vector and X S R nxp is a design matrix. Numerous work have 
been devoted to building efficient estimators of 0o when p is much larger than n. In such 
a situation, a classical approach amounts to assume that 9q is approximately sparse. This 
paper studies the minimax risks of estimation and testing over classes of fc-sparse vectors 8q. 
These bounds shed light on the limitations due to high-dimensionality. The results encompass 
the problem of prediction (estimation of X6>o), the inverse problem (estimation of 9o) and 
linear testing (testing X#o = 0). Interestingly, an elbow effect occurs when the number of 
variables fclog(p/fc) becomes large compared to n. Indeed, the minimax risks and hypothesis 
separation distances blow up in this ultra-high dimensional setting. We also prove that even 
dimension reduction techniques cannot provide satisfying results in an ultra-high dimensional 
setting. Moreover, we compute the minimax risks when the variance of the noise is unknown. 
The knowledge of this variance is shown to play a significant role in the optimal rates of 
estimation and testing. All these minimax bounds provide a characterization of statistical 
problems that are so difficult so that no procedure can provide satisfying results. 
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gression, high-dimensional geometry, minimax risk. 

1. Introduction 

In many important statistical applications, including remote sensing, functional MRI and gene 
expressions studies the number p of parameters is much larger than the number n of observations. 
An active line of research aims at developing computationally fast procedures that also achieve the 
best possible statistical performances in this "p larger than n" setting. A typical example is the 
study of Zi-based penalization methods for the estimation of linear regression models. However, if 
p is really too large compared to n, all these new procedures fail to achieve a good estimation. 

Thus, there is a need to understand the intrinsic limitations of a statistical problem: what is 
the best rate of estimation or testing achievable by a procedure? Is it possible to design good 
procedures for arbitrarily large p or are there theoretical limitations when p becomes "too large" ? 
These limitations tell us what kind of data analysis problems are too complex so that no statistical 
procedure is able to provide reasonable results. Furthermore, the knowledge of such limitations 
may drive the research towards areas where computationally efficient procedures are shown to be 
suboptimal. 

1.1. Linear regression and statistical problems 

We observe a response vector Y <E R" and a real design matrix X of size nx p. Consider the linear 
regression model 

Y = X# + e, (1.1) 

where the vector 9q of size p is unknown and the random vector e follows a centered normal distribu- 
tion JV(O n , a 2 I n ) . Here, 0„ stands for the null vector of size n and I n for the identity matrix of size n. 
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In some cases, the design X is considered as fixed either because it has been previously chosen 
or because we work conditionally to the design. In other cases, the rows of the design matrix X 
correspond to a n-sample of a random vector X of size p. The design X is then said to be random. 
A specific class of random design is made of Gaussian designs where X follows a centered normal 
distribution Af(O p , £). The analysis of fixed and Gaussian designs share many common points. In 
this work, we shall enhance the similarities and the differences between both settings. 

There are various statistical problems arising in the linear regression model (1.1). Let us list the 
most classical issues: 

(Pi) : Linear hypothesis testing. In general, the aim is to test whether 9q belongs to a linear 
subspacc of MP. Here, we focus on testing the null hypothesis Ho: {#o = Op}- In Gaussian design, 
this is equivalent to testing whether Y is independent from X. 

(P2) : Prediction. We focus on predicting the expectation E[Y] in fixed design and the conditional 
expectation E[Y|X] in Gaussian design. 

(P3) : Inverse problem. The primary interest lies in estimating 9o itself and the corresponding 
loss function is \\9 — #o||p 5 where ||.|| p is the 1% norm in W . 

(P4) : Support estimation aims at recovering the support of #0, that is the set of indices 
corresponding to non-zero coefficients. The easier problem of dimension reduction amounts 
to estimate a set M c {l,...p} of "reasonable" size that contains the support of #0 with high 
probability. 

Much work have been devoted to these statistical questions in the so-called high-dimensional 
setting, where the number of covariates p is possibly much larger than n. A classical approach to 
perform a statistical analysis in this setting is to assume that 9q is sparse, in the sense that most of 
the components of 9q are equal to 0. For the problem of prediction (P2), procedures based on com- 
plexity penalization are proved to provide good risk bounds for known variance [11] and unknown 
variance [6] but are computationally inefficient. In contrast, convex penalization methods such as 
the Lasso or the Dantzig selector are fast to compute, but only provide good performances under 
restrictive assumptions on the design X (e.g. [8, 13, 50]). Exponential weighted aggregation meth- 
ods [18, 40] arc another example of fast and efficient methods. The li penalization methods have 
also been analyzed for the inverse problem (P3) [8] and for support estimation (P4) [36, 49]. Dimen- 
sion reduction methods are often studied in more general settings than linear regression [17, 26]. In 
the linear regression model, the SIS method [25] based on the correlation between the response and 
the covariates allows to perform dimension reduction. The problem of high-dimensional hypothesis 
testing (Pi) has so far attracted less attention. Some testing procedures are discussed in [7, 3] for 
fixed design and in [44, 34] for Gaussian design. 

1.2. Sparsity and ultra-high dimensionality 

Given a positive integer k, we say that the vector 9q is fc-sparse if #0 contains at most fc non- 
zero components. We call k the sparsity parameter. In this paper, we are interested in the setting 
k < n < p. We note 0[fc,p] the set of fc-sparse vectors in R p . 

In linear regression, most of the results about classical procedures require that the triplet (fc, n,p) 
satisfies fc[l + log(p/fc)] < n. When k is "small", this corresponds to assuming that p is sub-expo- 
nential with respect to n. The analysis of the Lasso in prediction, inverse problems [8], and support 
estimation [38] entail such assumptions. In dimension reduction, the SIS method [25] also requires 
this assumption. If the multiple testing procedure of [7] can be analyzed for fc[l + log(p/fc)] larger 
than 71, it exhibits a much slower rate of testing in this case. In noiseless problems {a = 0), com- 
pressed sensing methods [23] fail when fc[l + log(p/fc)] is large compared to n (see [22] for numerical 
illustrations). In the sequel, we say that the problem is ultra-high dimensional 1 when fc[l+log(p/fc)] 
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is large compared to n. Observe that ultra-high dimensionality does not necessarily imply that p is 
exponential with respect to n. As an example, taking p = n 3 and k = n/ loglog(n) asymptotically 
yields an ultra-high dimensional problem. 

Why should we care about ultra-high dimensional problem? In this setting, there are so many 
variables that statistical questions such as the estimation of 6q (P3) or its support (P4) are likely to 
be difficult. Nevertheless, if the signal over noise ratio is large, do there exist estimators that perform 
relatively well? The answer is no. We prove in this paper that a phase transition phenomenon occurs 
in an ultra-high dimensional setting and that most of the estimation and testing problems become 
hopeless. This phase transition phenomenon implies that some statistical problems that are tackled 
in postgenomic of functional MRI cannot actually be addressed properly. 

Example 1.1 (Motivating example). In some gene network inference problems (e.g. [16]), the 
number p of genes can be as large as 5000 while the number n of microarray experiments is only of 
order 50. Let us consider a gene A. We note Qa the set of genes that interact with the gene A and 
k stands for the cardinality of Qa- How large can be k so that it is still "reasonable" to estimate 
Qa from the microarray experiments? In statistical terms, inferring the set of genes interacting 
with A amounts to estimate the support of a vector Oq in a linear regression model (see e.g. [38]). 
Our answer is that if k is larger than A, then the problem of network estimation becomes extremely 
difficult. We will come back to this example and explain this answer in Section 7. 

1.3. Minimax risks 

A classical way to assess the performance of an estimator 8 is to consider its maximal risk over a 
class C MP. This is the minimax point of view. For the time being, we only define the notions of 
minimaxity for estimation problems (P2 and P3). Their counterpart in the case of testing (Pi) and 
dimension reduction (P4) will be introduced in subsequent sections. Given a loss function Z(., .) and 
estimator 6, the maximal risk of over 0[fc,p] for a design X (or a covariance £ in the Gaussian 
design case) and a variance a 2 is defined by supg og Q[ fc p i Eg . a [l{9, 0o)]- Taking the infimum of the 
maximal risk over all possible estimators 0, we obtain the minimax risk 

inf sup Ee OiO .|7(0,0o)] • 
e e a ee[k, P ] 

We say that an estimator is minimax if its maximal risk over 0[/c,p] is close to the minimax risk. 

In practice, we do not know the number k of non-zero components of 9q and we seldom know 
the variance a 2 of the error. If an estimator does not require the knowledge of k and nearly 
achieves the minimax risk over 0[fc,p] for a range of k, we say that is adaptive to the sparsity. 
Similarly, an estimator is adaptive to the variance a 2 , if it does not require the knowledge of a 2 
and nearly achieves the minimax risk for all a 2 > 0. When possible, the main challenge is to build 
adaptive procedures. In some statistical problems considered here, adaptation is in fact impossible 
and there is an unavoidable loss when the variance or the sparsity parameter is unknown. In such 
situations, it is interesting to quantify this unavoidable loss. 

1.4- Our contribution and related work 

In the specific case of the Gaussian sequence model, where n = p and X = J n , the minimax 
risks over /c-sparse vectors have been studied for a long time. Donoho and Johnstone [21, 35] have 
provided the asymptotic minimax risks of prediction (P2)- Baraud [5] has studied the optimal 

0{n@) with f3 < 1. We argue in this paper that that as soon as k log(p)/n goes to 0, the case log(p) = 0(n' 3 ) is not 
intrinsically more difficult than conditions such as p = 0(n s ) with 8 > 0. 
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rate of testing from a non-asymptotic point of view while Ingster [31, 32, 33] has provided the 
asymptotic optimal rate of testing with exact constants. 

Recently, some high-dimensional problems have been studied from a minimax point of view. 
Wainwright [45, 46] provides minimax lower bounds for the problem of support estimation (P4). 
Raskutti et al. [39] and Rigollet and Tsybakov [40] have provided minimax upper bounds and lower 
bounds for (P2) and (P3) over l q balls for general fixed designs X when the variance a 2 is known 
(see also Ye and Zhang [47] and Abramovich and Grinshtcin [1]). Arias-Castro et al. [3] and Ingster 
et al. [34] have computed the asymptotic minimax detection boundaries for the testing problem 
(Pi) for some specific designs. However, their study only encompasses reasonable dimensional 
problems (p grows polynomially with n). Some minimax lower bounds have also been stated for 
testing (Pi) and prediction (P2) problems with Gaussian design [42, 44]. All the aforementioned 
results do not cover the ultra-high dimensional case and do not tackle the problem of adaptation 
to both k and a. 

This paper provides minimax lower bounds and upper bounds for the problems (Pi), (P2), (P3) 
when the regression vector 9q is /c-sparse for fixed and random designs, known and unknown vari- 
ance, known and unknown sparsities. The lower and upper bounds match up to possible differences 
in the logarithmic terms. The main discoveries are the following: 

1. Phase transition in an ultra-high dimensional setting. Contrary to previous work, 
our results cover both the high-dimensional and ultra-high dimensional setting. We establish 
that for each of the problems (Pi), (P2) and (P3), an elbow effect occurs when klog(p/k) 
becomes large compared to n. Let us emphasize the difference between the high-dimensional 
and the ultra- high dimensional regimes for two problems: prediction (P2) and support esti- 
mation (P4). 

Prediction with random design. In the (non-ultra) high-dimensional setting, the minimax risk 
of prediction for a random design regression is of order a 2 k log(p / k) / n (see Section 3). Thus, 
the effect of the sparsity k is linear and the effect of the number of variables p is logarithmic. 
In an ultra-high dimensional setting, that is when k\og{p/k)/n is large, we establish that an 
elbow effect occurs in the minimax risk. In this setting, the minimax risk becomes of order 
o~ 2 exp[Cfc{l + log(p/k)}/n], where C is a positive constant : it grows exponentially fast with 
k and polynomially with p (see the red curve in Figure 1). If it was expected that the mini- 
max risk cannot be small for such problems, we prove here that the minimax risk is in fact 
exponentially larger than the usual k\og(p/k)/n term. 

Support estimation. In a non-ultra high dimensional setting it is known [46] that under some 
assumptions on the design X (e.g. each component of X is drawn from iid. standard normal 
distribution) the support of a fc-sparse vector 9q is recoverable with high probability if 

Vi e supp(0 o ) , |(0 o )i| > C^\og{p)/no- , (1.2) 

where C is a numerical constant. In an ultra-high dimensional setting, even if 

■ii e supp(0 o ) , |(0 O )i| - cxp[Cfc{l + log(p/k)}/n]/Vka , (1.3) 

it is not possible to estimate the support of 9$ with high probability. Observe that the 
condition (1.3) is much stronger than (1.2). In fact, it is not even possible to reduce drastically 
the dimension of the problem without forgetting relevant variables with positive probability. 
More precisely, for any dimension reduction procedure that selects a subset of variables M C 
{1, . . .p} of size p s with some < <5 < 1 (described in Proposition 6.7), we have supp(#o) M 
with probability away from zero (see Proposition 6.7). Thus, it is almost hopeless to have 
a reliable estimation of the support of 9q even if ||#o||p/ ' 2 is large. This impossibility of 
dimension reduction for ultra-high dimensional problems is numerically illustrated in Section 
7. 
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2. Adaptation to the sparsity k and to the variance a 2 . Most theoretical results for the 
problems (Pi) and (P2) require that the variance a 2 is known. Here, we establish these 
minimax bounds for both known and unknown variance and known and unknown sparsity. 
The knowledge of the variance is proved to play a fundamental role for the testing problem 
(Pi) when k[l + log(p/k)] is large compared to y/n. The knowledge of a 2 is also proved to be 
crucial for (P2) in an ultra- high dimensional setting. Thus, specific work is needed to develop 
fast and efficient procedures that do not require the knowledge of the variance. Furthermore, 
variance estimation is extremely difficult in an ultra-high dimensional setting. 

3. Effect of the design. Lastly, the minimax bounds of (Pi), (P2) and (P3) are established 
for fixed and Gaussian designs. Except for the problem of prediction (P2), the minimax risks 
are shown to be of the same nature for both forms of the design. Furthermore, we investigate 
the dependency of the minimax risks on the design X (resp. £) in Sections 4-6. 

The minimax bounds stated in this paper are non asymptotic. While some upper bounds are 
consequences of recent results in the literature, most of the effort is spent here to derive the lower 
bound. These bounds rely on Fano's and Le Cam's methods [48] and on geometric considerations. 
In each case, near optimal procedures are exhibited. 

1 . 5. Organization of the paper 

In Section 3, we summarize the minimax bounds for specific designs called "worst-case" and "best- 
case" designs in order to emphasize the effects of dimensionality The general results are stated 
in Section 4 for the tests and Section 5 for the problem of prediction. The problems of inverse 
estimation, support estimation, and dimension reduction are studied in Section 6. In Section 7, 
we address the following practical question: For exactly what range of (k,p,n) should we consider 
a statistical problem as ultra-high dimensional? A small simulation study illustrates this answer. 
Section 8 contains the final discussion and side results about variance estimation. Section 9 is 
devoted to the proof of the mains minimax lower bounds. Specific statistical procedures allow to 
establish the minimax upper bounds. Most of these procedures are used as theoretical tools but 
should not be applied in a high dimensional setting because they are computationally inefficient. 
In order to clarify the statements of the results in Sections 4-6, we postpone the definition of these 
procedures to Section 10. The remaining proofs are described in a technical appendix [43]. 

2. Notations and preliminaries 

We respectively note ||.||„ and ||.|| p the I2 norms in K™ and R p , while (.)„ refers to the inner product 
in R". For any £> S K p and a > 0, fe ,a and Eg 0iCr refer to the joint distribution of (Y, X). When 
there is no risk of confusion, we simply write P and E. All references with a capital letter such as 
Section A or Eq.(A.3) refer to the technical Appendix [43]. 

In the sequel, we note supp(#o) the support of 0$. For any 1 < k < p, Ai(k,p) stands for the col- 
lections of all subsets of {1, ... ,p} with cardinality k. Given i G {1, ... ,p}, we note X^ the vector of 
size n corresponding to i-th column of X. For m C {1, . . . X rn stands for the n x |m| submatrix 
of X that contains the columns X,, i € m. In what follows, we note X T the transposed matrix of X. 

Gaussian design and conditional distribution. When the design is said to be "Gaussian", 
the n rows of X are n independent samples of a random row vector X such that X T ~ A/"(0 P , £). 
Thus, (Y, X) if a n-sample of the random vector (Y, X T ) £ R p+1 , where Y is defined by 

Y = X6o + e, (2.1) 

where e ~ Af(0,a 2 ). The linear regression model with Gaussian design is relevant to understand 
the conditional distribution of a Gaussian variable Y conditionally to a Gaussian vector since 
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E[y|X] = XOq and Yai{Y\X) = a 2 . This is why we shall often refer to a 2 as the conditional vari- 
ance of Y when considering Gaussian design. This model is also closely connected to the estimation 
of Gaussian graphical models [38, 44]. 

As explained later, the minimax risk over Q[k,p] strongly depends on the design X. This is why 
we introduce some relevant quantities on X. 

Definition 2.1. Consider some integer k > and some design X. 

**.+ (X):= sup Bf- and * fc ,_(X):= inf . (2.2) 



ee0[k, P ]\{o p } Wp ee0[k, P ]\{o p } 



In fact, <?fc. + (X) and (X) respectively correspond to the largest and the smallest restricted 
eigenvalue of order k o/X T X. 

Given a symmetric real square matrix A, ip mlL x(A) stands for the largest eigenvalue of A. Finally, 
C, C\,. . . denote positive universal constants that may vary from line to line. The notation C(.) 
specifies the dependency on some quantities. 

In the propositions, the constants involved in the assumptions are not always expressly specified. 
For instance, sentences of the form "Assume that n > C . Then, . . ." mean that "There exists an 
universal C > such that if n > C, then . . ." . 



3. Main results 

The exact bounds are stated in Section 4-6. In order to explain these results, we now summarize 
the main minimax bounds by focusing on the role of (k,n,p) rather than on the dependency on 
the design X. In order to keep the notations short, we do not provide in this section the minimal 
assumptions of the results. Let us simply mention that all of them arc valid if the sparsity k satisfies 
k < (p 1 / 3 ) A (n/5) and that p > n> C where C a positive numerical constant. 



3. 1 . Prediction 

3.1.1. Definitions 

First, the results are described for the problem of prediction (P2) since the problem of minimax 
estimation is more classical in this setting. Different prediction loss functions are used for fixed and 
Gaussian designs. When the design is considered as fixed, we study the loss ||X(#i — ^2) II ("^cr 2 ). 
For Gaussian design, we consider the integrated prediction loss function: 

- e 2 )\\ 2 p /a 2 = e [{x(e 1 - e 2 )} 2 } /a 2 . (3.1) 

Given a design X, the minimax risk of prediction over 0[fc,p] with respect to X is 

R F [k,X]=w£ sup Eg . a [\\X(9-9 a )\\ 2 J(na 2 )} . (3.2) 
9 e ee[fc, P ] 

For a Gaussian design with covariance E, we study the quantity 

K R [k,T,} :=inf sup Efl , ff [||Vs(?- )\\ 2 p /a 2 ] . (3.3) 
e e ee[k, P ] 

These minimax risks of prediction do not only depend on (k, n,p) but also on the design X (or on 
the covariance E). The computation of the exact dependency of the minimax risks on X or E is 
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a challenging question. To simplify the presentation in this section, we only describe the minimax 
prediction risks for worst-case designs defined by 

TZ F [k] :=sup n F [k,X], K R [k] := supK R [k, E] , (3.4) 

X E 

the supremum being taken over all designs X of size n x p (resp. all covariance matrices E). The 
quantity lZp[k] corresponds to the smallest risk achievable uniformly over 0[/c,p] and all designs X. 
It is shown in Section 5 that the quantity TZr [k] is achieved (up to constants) for a covariance E = I p 
while the quantity 72-i?[fc] is achieved with high probability for designs X that are realizations of 
the standard Gaussian design (all the components of X arc drawn independently from a standard 
normal distribution). This corresponds to designs used in compressed sensing [23]. In fact, the 
maximal risks lZp[k] and lZ R [k] for the prediction problem correspond to typical situations where 
the designs is well-balanced, that is as close as possible to orthogonality. 

3.1.2. Results 

In the sequel, we say that TZ F [k] is of order f(k,p, n, C), where C is positive constant when there 
exist two positive universal constants C\ and C*2 such that 

f{k,p,n,Cx)<H F [k] < f(k,p,n,C 2 ) . 

These minimax risks are computed in Section 5 and are gathered in Table 1. They are also 
depicted on Figure 1. 




k 



Ultra-high dimension 



Figure 1. Minimax prediction risk (P?) over @[k,p] as a function of k for fixed and random design and known 
and unknown variance. The corresponding bounds are stated in Section 5. 



Fixed Design: K F [k] 


Gaussian Design: Kii[k] 


0[£**8)]ai 


Cl |lo g (f)exp[c4log(f)] 



Table 1: Orders of magnitude of the minimax risks of prediction lZ F [k] and lZn[k] over 0[fc,p]. 

When k\og(jp/k) remains small compared to n, the minimax risk of prediction is of the same 
order for fixed and Gaussian design. The k\og{p/k)/n risk is classical and has been known for 
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a long time in the specific case of the Gaussian sequence model [35]. Some procedures based on 
complexity penalization or aggregation (e.g. [11]) are proved to achieve these risks uniformly over 
all designs X. Computationally efficient procedures like the Lasso or the Dantzig selector are only 
proved to achieve a klog(p)/n risk under assumption on the design X [8]. If the support of 9q is 
known in advance, the parametric risk is of order k/n. Thus, the price to pay for not knowing the 
support of #o is only logarithmic in p. 

In an ultra-high dimensional setting, the minimax prediction risk in fixed designs remains smaller 
than one. It is the minimax risk of estimation of the vector E(Y) of size n. This means that the 
sparsity index k does not play anymore a role in ultra-high dimension. For a Gaussian design, 
the minimax prediction risk becomes of order C\(p/k) C2k / n : it increases exponentially fast with 
respect to k and polynomially fast with respect to p. Comparing this risk with the parametric rate 
k/n, we observe that the price to pay for not knowing the support of 9q is now far higher than log(p). 

In Section 5, we also study the adaptation to the sparsity index k and to the variance a 2 . We 
prove that adaptation to k and a 2 is possible for a Gaussian design. In fixed design, no procedure 
can be simultaneously adaptive to the sparsity k and the variance a 2 (see the red curve in Figure 
1 that corresponds to fixed design, a and k unknown). 

3.2. Testing 

3.2.1. Definitions 

Let us turn to the problem (Pi) of testing Ho: {#o = P } against Hi: {0$ £ @[k,p] \ {0 P }}. We 
fix a level a > and a type II error probability S > 0. Minimax lower and upper bounds for this 
problem are discussed in Section 4. 

Suppose we are given a test procedure $ Q of level a for fixed design X and known variance a . 
The ^-separation distance of $ a over 0[fc,p], noted p F [<f> a ,k, X] is the minimal number p, such 
that <& Q rejects Ho with probability larger than 1 — S if ||X0o||n/-\/™ > po- Hence, p F [$> a , k, X] 
corresponds to the minimal distance such that the hypotheses {#o = P } and {9q 6 Q[k,p], 
!|X#o||^ > np F [$> a , fc,X]cr 2 } are well separated by the test $ Q . 

p F [<S> ai k,X] :=inf \p>0, inf Ve„A^ a = 1] > 1 - 5} . 

Although the separation distance also depends on 6, n, and p, we only write pp[^ a , fe,X] for the 
sake of conciseness. By definition, the test $ Q has a power larger than 1 — 6 for #o € ©[^.p] such 
that |jX# ||, 2 i > p F [$> a ,k,yL). Then, we consider 

p* F [k,X] := inf p[* a , A, X] . (3.5) 

The infimum runs over all lcvcl-a tests. We call this quantity the (a, (S)-minimax separation distance 
over 0[/c,p] with design X and variance a 2 . The minimax separation distance is a non- asymptotic 
counterpart of the detection boundaries studied in the Gaussian sequence model [20] . 

Similarly, we define the (a, (5)-minimax separation distance over 0[/c,p] with Gaussian design by 
replacing the distance ||X#o||n/v / "' by the distance ||\/S^o||p : 

p R [$ a , k, S] := inf {p > 0, inf Pe , CT [$ Q = 1] > is] , p* R [k, E] := inf p R [$ a , k, S] . 

L 9 ee[k,p), ||Vs0 o ||j>>p<T J *° 

(3.6) 
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Various bounds on p^.[fc,X], p* R [k, S] are stated in Section 4. In this section, we only provide the 
orders of magnitude of the minimax separation distances in the "worst case" designs in order to 
emphasize the effect of dimensionality: 

p* F [k] :=supp^[fc,X] , p* R [k] := sup p R [k,Y,} . (3.7) 

x s 

This is the smallest separation distance that can be achieved by a procedure $ a uniformly over all 
designs X (resp. £). As for the prediction problem, it will be proved in Section 4, that the quantity 
p* F [k] and p* R [k] are achieved for well-balanced designs. 

It is not always possible to achieve the minimax separation distances with a procedure <& a 
that does not require the knowledge of the variance a 2 . This is why we also consider p* F v [k] 
and p* R y [k] the minimax separation distance for fixed and Gaussian design when the variance is 
unknown. Roughly, p* F u [k] corresponds to the minimal distances p 2 that allows to separate well 
the hypotheses {6*o = P and a > 0} and {9 <G 0[fc,p] and a > , ||X0 o |i^/c r2 > np 2 } when a is 
unknown. We shall provide a formal definition at the beginning of Section 4. 



3.2.2. Results 

In Table 2, we provide the orders of the minimax separation distances over Q[fc,p] for fixed and 
Gaussian designs, known and unknown variance (see also Figure 2). 




k log(p) large 
compared to 



Ultra-high dimension 



Figure 2. Orders of magnitude of the minimax separation distances (pj,[/c]) 2 , (p* R [k\) 2 , {p* F jj[fc]) 2 and (p* R u[k]) 2 
over 0[k,p] (Pi) for fixed and random designs and known and unknown variances. Here, pp[k] and p* R [k\ behave 
similarly while p* F jj[k] and p* R rj[k] behave similarly. The corresponding bounds are stated in Section 4- 



Fixed and Gaussian Design 



Known a 2 -. {p* F [k]) 2 and (p* R [k]) 2 
Unknown a 2 -. {p FU [k}) 2 and (p* RtU [k]) 2 



C (a,5)M2SM A i 

„, c>fciog(p) r„ , fciog(p) 

C(a,S) ^^-exp C 2 {a,$) jp^ 



Table 2: Order of the minimax separation distances over Q[k,p] for fixed and Gaussian design, 
known and unknown variance: (p* F [k}) 2 , (p* R [k}) 2 , (p* F u[k]) 2 , and (p R jj[k]) 2 . 
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In contrast to (P2), the minimax separation distances are of the same order for fixed and 
Gaussian design. 

1. When fclog(p) < y/n, all the minimax separation distances are of order klog(p)/n. This 
quantity also corresponds to the minimax risk of prediction (P2) stated in the previous 
subsection. This separation distance has already been proved in the specific case of the 
Gaussian sequence model [5, 20]. 

2. When fclog(p) > y/n, the minimax separation distances are different under known and un- 
known variance. If the variance is known, the minimax separation distance over Q[k,p] stays 
of order \Jyfn. Here, 1/y/n corresponds in fixed design to the minimax separation distance of 
the hypotheses {E[Y] = 0„} against the general hypothesis {E[Y] 7^ 0„} for known variance 
(see Baraud [5]). 

3. If the variance is unknown, the minimax separation distance over 0[fc,p] is still of order 
klog(p)/n if k\og(p) is small compared to n. In contrast, the minimax separation distance 
blows up to the order Cip C2k ^ n in a ultra-high dimensional setting. This blow up phenomenon 
has also been observed in the previous section for the problem of prediction (P2) in Gaussian 
design. In conclusion, the knowledge of the variance is of great importance for fclog(p) larger 
than y/n. 

3.3. Inverse problem and support estimation 

3.3.1. Definitions 

In the inverse problem (P3), we are primarily interested in the estimation of #0 rather than X#o- 
This is why the loss function under study is \\0\ — #2 Hp- Minimax lower and upper bounds for this 
loss function are discussed in Section 6. For a fixed design X, the minimax risk of estimation is 

KZ F [k,X] :=inf sup E 0o . a [\\9 o - efja 2 ) . (3.8) 
e ee[fe, P ] 

If one transforms the design X by an homothety of factor A > 0, then this multiplies the minimax 
risk for the inverse problem by a factor 1/A 2 . For the sake of simplicity, we restrict ourselves to 
designs X such that each column has been normed to y/n. The collection of such designs is noted 
T> niV . The supremum of the minimax risks over the designs T> n ^ p is +00. Take for instance a design 
where the two first columns are equal. In this section, we only present the infimum of the minimax 
risks over Q[k,p] as X varies across T> n ^ p : 

TZlpM := inf ni F [k,X] . 

xen„, p 

The quantity lZIp[k] is interpreted the following way: given (k,n,p) what is the smallest risk we 
can hope if we use the best possible design? Alternatively, given n observations, what is the intrin- 
sic difficulty of estimating a fc-sparse vector of size pi We call this quantity the minimax risks for 
the inverse problem over Q[k,p\. 

In Section 6, we also study the corresponding the minimax risks of the inverse problem in the 
random design case. Let S p stand for the set of covariance matrices that contain only ones on the 
diagonal. We respectively define the minimax risk of estimation over 0[fc,p] for a covariance £ and 
the minimax risk of estimation over 0[fc,p] as 

Tll R [k,T,} := inf sup E eo . ff [||0 o - 0\\ 2 p /<J 2 } and ni R [k] := inf KX R [k,V] . (3.9) 
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3.3.2. Results 

In Table 3, we provide the minimax risks in fixed design for different values of (k,n,p) (see also 
Figure 3). 



(k, n, p) 


k log(p) < Cn 


fclog(p) > nlog(n) 


Minimax risk TiX F [k] 




exp[c7'A log (£)]. 



Table 3: Order of the minimax risks TZIp[k] for the inverse problem over 0[/c,p] 




Ultra-high dimension 



Figure 3. Order of magnitude of the minimax risk HX p[k] for the inverse problem (P3) overQ[k,p] as a function 
of k . The corresponding bounds are stated in Section 6. 

If felog(p/fc) remains smaller than n, it is possible to recover the risk Ck\og(p/k) for "good" 
designs. This risk is for instance achieved by the Dantzig selector of Candes and Tao [15] for nearly- 
orthogonal designs, that roughly means that the restricted eigenvalues ^3fc j+ (X) and <?3fe i „(X) of 
X T X are close to one. In an ultra high-dimensional setting, it is not anymore possible to build 
nearly-orthogonal designs X and the minimax risk of the inverse problem blows up as for testing 
problems (Pi) or prediction problems in Gaussian design (P2)- Moreover, adaptation to the spar- 
sity k and to the variance a 2 is possible for the inverse problem. As explained in Section 6, the 
quantities 1ZXji[k, X] and TZXji[k] behave somewhat similarly to their fixed design counterpart. 

In Section 6, we also discuss the consequences of the minimax bounds on the problem of support 
estimation (P4). We prove that, in an ultra-high dimensional setting, it is not possible to estimate 
with high probability the support of Oq unless the ratio ||#o||p/c 2 is larger than C\(jp/k) c ^ k / n . 
In fact, even the problems of support estimation is almost hopeless in an ultra-high dimensional 
setting. 

4. Hypothesis Testing 

We start by the testing problem (Pi) because some minimax lower bounds in prediction and inverse 
estimation derive from testing considerations. 
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4-1- Known variance 

Gaussian design 

As mentioned in the introduction, the knowledge of a 2 = Vai(Y\X) is really unlikely in many 
practical applications. Nevertheless, we study this case to enhance the differences between known 
and unknown conditional variances. Furthermore, these results turn out to be useful for analyzing 
the minimax separation distances in fixed design problems. We recall that the notions of minimax 
separation distances p* F [k, X], p* F [k], p* R [k, E], and p* R [k] have been defined in Section 3.2. 

Theorem 4.1. Assume that a + S < 53%, p > n 2 , and that n > 81og(2/<5). For any 1 < k < n, 

the (a, S) -minimax separation distance (3.6) with covariance I p is lower bounded by 



{p* R [k,I p ]f>C 
For any 1 < k < p and any covariance E, we have 

(^[fc,s]) 2 < CM) 



k . . 1 
- log (p) A -= 

n Jn 



k , . 1 
- log (p) A -= 

n Jn 



(4.1) 



(4.2) 



Furthermore, this upper bound is simultaneously achieved for all k and E by a procedure T* ( defined 
in Section 10.1.1). 

Remark 4.1. [Adaptation to sparsity] It follows from Theorem 4.1 that adaptation to the 
sparsity is possible and that the optimal optimal separation distance is of order 



- log (p) A —= , 

n Jn 



(4.3) 



for all sparsities k between 1 and n. 

Remark 4.2. [Correlated design] The upper bound (4-2) is valid for any covariance matrix E. 
In contrast, the minimax lower bound (4-1) is restricted to the case E = I p . This implies that there 
exists some constant C(a,5) such that , 

P* R [k,I P ] > C(a,<5) supp^E] = C{a,S)p* R [k] . 



In other words, the testing problem is more complex (up to constants) for an independent design 
than for a correlated design. 

Remark 4.3. [Which logarithmic term in the bound: log(p) or log(p/fc)?] In the proof of 
Theorem 4-1, we derive the following bounds 



c 



log 1 



P_ 

k 2 



< 



C(a,5) 



los 



These two bounds are of order of (4-3) as it is assumed that p > n 2 . However, the dependency of the 
logarithmic terms on k in the last bounds do not allow to provide the minimax separation distance 
when p = n and k is close to Jn. For instance, if p = n and k = Jn/\og(n), the two bounds 
only match up to a factor log(n)/loglog(n). The non- asymptotic minimax bounds of Baraud [5] 
in the Gaussian sequence model suffer the same weakness. Up to our knowledge the dependency on 
log(fc) of the minimax separation distances has only been captured in an asymptotic setting [3, 34] 
((k,p, n) -> 00). 
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4-1-2. Fixed design 

The separation distances are similar to the Gaussian design case. 

Theorem 4.2. Assume that a + 5 < 33%, p > n 2 > C(a,S), and that n > 81og(2/5). For any 
1 < k < n, there exist some n x p designs X such that 

'k . , , 1 



(pJ.[A,X]) 2 >C 
For any I < k < p and any design X, we have 

(p* F [k,X}) 2 <C(a 7 S) 



log (p) A 



log (p) A 



(4.4) 



(4.5) 



Furthermore, this upper bound is simultaneously achieved for all k and X by a procedure T* ( defined 
in Section 10.1.1). 

As for the random design case, we conclude that adaptation to the sparsity is possible and that 
(PfM) 2 i s °f order ^ log (p) A In fact, the proof shows that, with large probability, designs X 
whose components are independently sampled from a standard normal variable satisfy (4.4). 

Arias-Castro et al. [3] and Ingster et al. [34] have recently provided the asymptotic minimax 
separation distance with exact constant for known variance when the design satisfies very specific 
conditions. Theorem 4.2 provides the non-asymptotic counterpart of their result, but the constants 
in (4.4) and (4.5) are not optimal. 



4-2. Unknown variance 
4-2.1. Preliminaries 

We now turn to the study of the minimax separation distances when the variance o 2 is unknown. In 
Section 3.2, we have introduced the notions of ^-separation distances and (a, (5)-minimax separation 
distances when the variance o 2 . We now define their counterpart for an unknown variance a 2 . 

Let us consider a test <& Q of the hypothesis Ho for the linear regression model with fixed design 
X. We say that & a has a level a under unknown variance if 

su P P 0p . CT [$ Q (Y,X) > 0] < a . 

<T>0 

This means that the type I error probability is controlled uniformly over all variance o 2 . Similarly, 
we want to control the type II error probabilities uniformly over all variances. The (5-separation 
distance pF,u[^a, fc,X] of $ Q over 0[fc,p] for unknown variance is defined by 

p P ,u[* a ,k,X] :=inf Jp>0, inf F 0Ol<r [9 a = l]>l- S\ . (4.6) 

{ <r>o, e ee[k, P ], J 

||X0 o |[„>V^pcr 

Hence, PF.ui^a, k, X] corresponds to the minimal distance such that the hypotheses {#o = p and a > 
0} and {#o S &[k,p] and o > , ||X0 o ||n ^ n PF,ui^a, fc,X]er 2 } are well separated by the test $ a . 
Taking the infimum over all level a tests, we get the (a, S) minimax separation distance over 0[fc,p] 
with design X and unknown variance is 

p* FU [k,X] :=MpF,u[^k,X] . (4.7) 

Finally, p* FU [k] := sup x p* F v [k, X] corresponds to the (a, 5) -minimax separation distance over 
9[fc,p] with the "worst-case designs". 

In the Gaussian design, we define pn,u[&a, k, E], p* R v [k, E], and Pjuj[k] analogously to (4.6) 
and (4.7) by replacing the norm ||X0 o ||ti/v / « by ||\/E#o||p- 
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4-2.2. Gaussian design 



Minimax bounds have been proved in [44] in the non ultra-high dimensional setting. The next 
theorem encompasses high dimensional and ultra-high dimensional settings. 

Theorem 4.3. Suppose that a + 5 < 53% and that p > n > 81og(2/<$). For any 1 < k < [p 1/3 \ , the 
{a, 5) -minimax separation distance over Q[k,p] with covariance I p and unknown variance satisfies 



(p 



R,U 



k 

k, Ip}) 2 >C\- log {p) exp 



C 2 -log(p) 

n 



For any 1 < k < n/2 and any covariance E, we have 

k 



(p R vf[k,n <<?!(«,«$)- log (|) exp C 2 (a, S)- log (|) 



(4.8) 



(4.9) 



Furthermore, this upper bound is simultaneously achieved for all k and E by a procedure T a ( defined 
in Section 10.1.2). 

Remark 4.4. [Minimax adaptation] It follows from Theorem 4- 3 that, under unknown variance, 
adaptation to the sparsity is possible and that the minimax separation distance (p* R u[k]) 2 over 
Q[k,p] is of order 

;„ 7„ 

(4.10) 



Ci («,£)- log (p) exp 
n 



C 2 {a, 6)- log (p) 
n 



Remark 4.5. The condition k < p 1 ^ 3 can be replaced by k < p 1 / 2-7 with 7 > 0, the only difference 
being that the constants involved in (4-8) would depend on 7. These conditions are not really 
restrictive for a sparse high- dimensional regression since the usual setting is k < n <^ p. 

Note k < p 3 implies that log(p) < 3/21og(p/fc) < 31og(p/fc 2 ) so that we cannot distinguish terms 
C\ log(p) from Ci\o%{pjk 2 ) or C^\og{p/k). As a consequence (4-10) does not necessarily capture 
the right dependency on k in the logarithmic terms. This observation also holds for all the next 
results that require k < p 1 / 3 . 

Remark 4.6. [Dependent design] As for the known variance case, we have p*mj[k,I p } > 
C(a,S)p* RU [k], that is the testing problem is more complex for an independent design than for 
a correlated design. For some covariance matrices S, the minimax separation distance with covari- 
ance E is much smaller than p* RU [k, I p ]. Verzelen and Villers [44] provide such an example of a 
matrix E in (see Propositions 8 and 9). However, the arguments used in the proof of their example 
are not generalizable to other covariances. In fact, the computation of sharp minimax bounds that 
capture the dependency of p* R (j[fc,E] on E remains an open problem. 



4-2.3. Fixed design 



Ingster ct al. [34] derive the asymptotic minimax separation distance for some specific design when 
k\og(p)/n goes to 0. Here, we provide the non asymptotic counterpart that encompass all the 
regimes. 

Proposition 4.4. Assume that a + 5 < 26% and that p > n > C(a,S). For any 1 < k < [p 1/3 J , 
there exist some n x p designs X such that 



( / 9^ c/ [fc,X]) 2 >C 1 -log(p)exp 
n 

For any 1 < k < n/2 and any n x p design X, we have 

k 



C 2 -log (p) 
n 



(p^[fc,X]) 2 ^(c*,^ log (|) exp C 2 M)^log(f) 



(4.11) 



(4.12) 
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Furthermore, this upper bound is simultaneously achieved for all k and X by a procedure T a ( defined 
in Section 10.1.2). 

Again, we observe a phenomenon analogous to the random design case. 

4-3. Comparison between known and unknown variance 

There are three regimes depending on (k,p, n). They are depicted on Figure 2: 

1. klog(p) < y^n. The minimax separation distances are of the same order for known and un- 
known a 2 . The minimax distance fclog(p)/n is also of the same order as the minimax risk of 
prediction. 

2. y/n < klog(p) < n. If a 2 is known, the minimax separation distance is always of order 1 / y/n. 
In such a case, an optimal procedure amounts to test the hypothesis {E[|| Y|| 2 J = ncr 2 } against 
{E[||Yj| 2 ] > na 2 } using the statistic ||Y|| 2 /cr 2 . If a 2 is unknown, the statistic ||Y|| 2 /<7 2 is 
not available and the minimax separation distance behaves like klog(p)/n. 

3. klog(p) > n. If a 2 is unknown, the minimax separation distance blows up. It is of order 
[p/k) Ck / n . Consequently, the problem of testing {6 = P } becomes extremely difficult in 
this setting. 

5. Prediction 

In contrast to the testing problem, the minimax risks of prediction (P2) exhibit really different 
behaviors in fixed and in random design. The big picture is summarized in Figure 1. Wc recall that 
the minimax risks 72-i?[A;,X], 7vLf[/c], TZ-n[k, E], and TZn[k] arc defined in Section 3.1. 



5.1. Gaussian design 

Proposition 5.1. [Minimax lower bound for prediction] Assume that p > C. For any 1 < 

k < Ljj 1 / 3 J , we have 

K *y> C il„ g (f)cx P {c^lo g (f)}. (5.1) 

Remark 5.1. [General covariances S] The lower bound (5.1) is only stated for the identity 
covariance S = I p . For general covariance matrices S, we have 

for any k < n < p/2. This statement has been proved in [4-2] (Proposition 4-5) in the special case 
of restricted isometry, but the proof straightforwardly extends to restricted eigenvalue conditions. 
For E = I p , the lower bound (5.2) does not capture the elbow effect in an ultra-high dimensional 
setting (compare with (5.1)). 

Theorem 5.2. [Minimax upper bound] Assume that n > C . There exists an estimator 9 V 
(defined in Section 10.2.1) such that the following holds: 

1. The computation of 9 V does not require the knowledge of a 2 or k. 

2. For any covariance E, any a > 0, any 1 < k < \_(n — l)/4j , and any 9q £ 0[/c,p] we have 



< Cl ^log(?)exp(c 2 -lo g mU 2 . 
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In contrast to similar results such as Theorem 1 in Giraud [27] or Theorem 3.4 in Verzelen [42], 
we do not restrict k to be smaller than n/(2 logp), that is we encompass both high-dimensional and 
ultra-high dimensional setting. The proof of the theorem is based on a new deviation inequality 
for the spectrum of Wishart matrices stated in Lemma 11.2. 

Remark 5.2. [Minimax risk] We derive from Theorem 5.2 and Proposition 5.1 that the minimax 
risk TZr [k] is of order 

c -N(?)-H-(f); 

If k log(p//c) is small compared to n, the minimax risk of estimation is of order Cklog(p/k)/n. In 
an ultra-high dimensional setting, we again observe a blow up. 

Remark 5.3. [Adaptation to sparsity and the variance] The estimator V does not requires 
the knowledge of k and of the variance a 1 = Var(Y\X). It follows that 6 V is minimax adaptive to 
all 1 < k < p 1 / 3 A [(n — l)/4] and to all a 2 > 0. As a consequence, adaptation to the sparsity and 
to the variance is possible for this problem. 

Remark 5.4. [Dependent design] The risk upper bound of 9 V stated in Theorem 5.2 is valid for 
any covariance matrix £ of the covariance X . In contrast, the minimax lower bound of Theorem 
4-3 is restricted to the identity covariance. This implies that the minimax prediction risk for a 
general matrix £ is at worst of the same order as in the independent case: there exists a universal 
constant C > such that for all covariance £, 

n R [k,i p ] > cn R [k] . 

In Remark 5.1, we have stated a minimax lower bound for prediction that depends on the re- 
stricted eigenvalues of E. Fix some < 7 < 1. If we consider some covariance matrices E such 
that ^2fc,-(v / E)/^2fe, +(\/E) > 1 — 7 , the minimax lower bound (5.2) and the upper bound (5.3) 
match up to a constant C(y). In general, the lower bound (5.2) and the upper bound (5.3) do not 
exhibit the same dependency with respect to E, especially when #2fc,-(v / E)/^2fc,+ (v / E) is close to 
zero. 

5.2. Fixed design 

5.2.1. Known variance 

The minimax prediction risk with known variance has been studied in Raskutti et al. [39] and 
Rigollet and Tsybakov [40] (see also [1, 47]). For any design X and any 1 < k < n, these authors 
have proved that the minimax risk TZp[k,lC\ satisfies 

C^n { p^^o g (^)<nAk,X}<C 2 k -lo g m . (5.4) 
s<k <P2s,+ (X) n \ s / n \ k J 

Next, we bound the supremum sup x Rp[k, X] and we study the possibility of adaptation to the 
sparsity. 

Proposition 5.3. For any 1 < k < n, the supremum sup x TZp[k, X] is lower bounded as follows 



K F [k] > C 







-log 


(I)-] 


n 





(5.5) 



Assume that p > n. There exists an estimator 9 BM (defined in Section 10.2.2) which satisfies 

(5.6) 



sup sup E eo ,„ \\X(9 BM - e )\[i /(no»)<C 
x e ee[k, P ] 



for any 1 < k < n. 
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This upper bound (5.6) is a consequence of Birge and Massart [11]. 

Remark 5.5. If klog(p/k) is small compared to n, the minimax risk is of order Ck\og(p/k)/n. In 
an ultra-high dimensional setting, this minimax risk remains close to one. This corresponds (up to 
renormalization) to the minimax risk of estimation of the vector E[Y] of size n . As a consequence, 
the sparsity assumption does not play anymore a role in a ultra-high dimensional setting. From 
(5.6), we derive that adaptation to the sparsity is possible when the variance a 2 is known. 

Remark 5.6. [Dependency of TZp[k, X] on X] For designs X, such that the ratio ^2fe,-(X)/<?2fe,+ (X) 
is close to one, the lower bounds and upper bounds of (5.4) agree with each other. This is for in- 
stance the case of the realizations (with high probability) of a Gaussian standard independent design 
(see the proof of Proposition 5.3 for more details). 

However, the dependency of the minimax lower bound in (5.4) on X is not sharp when the 
ratio <&2k,— (X)/#2fc,+ (X) is away from one. Take for instance an orthogonal design with p = n 
and duplicate the last column. Then, the lower bound ( 5.4 ) for this new design X is while the 
minimax risk is of order klog(p/k)/n. 

Similarly, the dependency of the minimax upper bound in (5.4) on X is not sharp. For very spe- 
cific design, it is possible to obtain a minimax riskTZp[k, X] that is much smaller than k/n log(p/fc)A 
1 (see Abramovich and Grinshtein [1]). 

Remark 5.7. [Comparison with l\ procedures] The designs X for which l\ procedures such 
as the Lasso or the Dantzig selector are proved to perform well require that <?2fc,-(X)/^2fc.+ (X) is 
close to one. It is interesting to notice that these designs X precisely correspond to situations where 
the minimax risk is close to its maximum k\og{p/k)/n (see Equation (5.4))- We refer to [39] for 
a more complete discussion. 

Remark 5.8. We easily retrieve from (5.4) a result of asymptotic geometry first observed by 
Baraniuk et al. [4] in the special of restricted isometry property [14]- For any < 5 < 1, there 
exists a constant C(S) > such that no n x p matrix X can fulfill ^fe ! _(X)/^fc i _|_(X) > 8 if 
fc(l + log(p/fc)) > C(S)n. 

Proof. E#2fc,-(X)/#2ib,+ (X) > S, then TZ F [k,X] > CSk\og(ep/k) /n. 

We also have 7\Li?[fc,X] < 7Zf[p, X] < 1. The last inequality follows from the risk of an estimator 
n € argminggRp ||Y — X0||^. Gathering these two bounds allows to conclude. □ 

5.2.2. Unknown variance 

We now consider the problem of prediction when the variance a 2 is unknown. 

Proposition 5.4. For any 1 < k < n, there exists an estimator #^ fc ) that does not require the 
knowledge of a 2 such that 



supsup sup Ee , a ||X(^ fe )-0 o )||2 /(na 2 )<C 
x CT>oe ee[fc,p] L J 



-lo ( — 

n V k 



A 1 



(5.7) 



Thus, the optimal risk of prediction over 0[£;,p] remains of the same order for known and 
unknown a 2 . 

Let us now study to what extent adaptation to the sparsity is possible when the variance a 2 
is unknown. In order to get some ideas let us provide risk bounds for two procedures that do 
not require the knowledge of a: the estimator V already studied for Gaussian design (defined in 
Section 10.2.1) and the estimator 9 n defined by n £ argmine e i{p ||Y — X6>||J;. 
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Proposition 5.5. [Risk bound for 8 V and 8 n ] Assume that n > 14. For any 1 < fc < |_(ti— 1)/4J , 
the maximal risk of 9^ over Q[k,p] is upper bounded as follows 



sup sup sup Ee 0i0 
x <j>oe ee[k,p] 



|X( 



k 



/(n^)<Ci-bg ^ exp 



C 2 -log(? 



(5.8) 



For any 1 < fc < n, i/ie maximal risk of n over Q[k,p] is upper bounded as follows 



SUp SUp SUp Eg 0l cr 

x <t>o e o e0[k, P ] 



/(na 2 ) < 1 . 



(5.9) 



The risk bound (5.8) is also satisfied by the procedure of Baraud et al. [6]. The proof of (5.8) is 
a consequence of one of their results. 

Remark 5.9. As a consequence, 9 V simultaneously achieves the minimax risk over all Q[k,p\ 
for all k < \_(n — 1)/4J such that k(l + log(p)/k) < n. In an ultra-high dimensional setting, the 
maximum risk of 9 V over Q[k,p] is controlled by (ep/k) ck ^ n while the minimax risk is smaller 
than 1. // the upper bound (5.8) is sharp then this would imply that 9 V is not adaptive to the 
sparsity in an ultra-high dimensional setting. 

In contrast, 9 n is minimax adaptive over all Q[k,p] such that fc(l+log(jj)/fc) > n, but its behavior 
is suboptimal in a non-ultra-high dimensional setting. 

In order to get an estimator that is adaptive to all indexes k, we would need to merge the 
properties of 9 V (for non-ultra-high dimensional cases) and of 9 n (for ultra-high dimensional cases) . 
The following proposition tells us that it is in fact impossible. 

Proposition 5.6. [Adaptation to the sparsity is impossible under unknown variance] 

Consider any p > n > C\ and 1 < k < [p 1 ^ 3 ] such that k\og(ep/k) > C^n. There exists a design 
X of size n x p such that for any estimator 9, we have either 



SUp Eop.cr 
cr>0 



\M0-0 P )fj(na 2 ) 



> C 



or 



SUp SUp 1^00, cr 

<?>o e ee[k, P ] 



\\nO-0 )\\l/(na 2 ) 
As a benchmark, we recall the minimax upper bounds: 







> exp 









— log I?) Al 



■K F [l]<Ci^^ and K F [k}<C 2 



The proof of proposition 5.6 is based on the minimax lower bounds (4.11) for the testing problem 
(Pi) under unknown variance. The proof uses designs X that are realizations of standard Gaussian 
designs. 

Remark 5.10. In the setup of Proposition 5.6, any estimator 9 that does not require the knowledge 
of k and a 2 has to pay at least one of these two prices: 

1. The estimator 9 does not use the sparsity of the true parameter 9q. Its risk for estimating P 
is of the same order as the minimax risk over W . The estimator 9 n has this drawback. 

2. For any 1 < k < p 1 ^ 3 , we have 



sup sup sup 

x er>oe ee[fc,; 



\X(9-8 )\\l/(na 2 ) 



> Ci-log 
n 



exp 



C 2 - log 



(f) 



This is the price for adaptation when a 2 is unknown. The estimator 9 V exhibits this behavior. 
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As a conclusion, it is impossible to merge the qualities of 6 V and of 9 n . 

The best prediction risk that can be achieved by a procedure that aim to adaptation to the sparsity 
is of order 



k /p 
n l0g U 



exp 



C-\og(p/k) 
n 



In other words, the unavoidable loss for adaptation for unknown variance is a factor exp[C 'k / n\og{p / k)] 
In this sense, the estimator 8 V (and as a byproduct the procedure of Baraud et al. [6]) achieves the 
optimal prediction risk under unknown variance and unknown sparsity. 



In conclusion, the minimax risks of prediction are of the same order for fixed and Gaussian 
design and for known and unknown variance when k\og(p/k) is small compared to n. In an ultra- 
high dimensional setting, the minimax risks behave differently. For Gaussian design, the minimax 
risk is of the order (p/k) Ck / n . In contrast, the minimax risk of prediction remains smaller than one 
for fixed design regression with known variance. When the sparsity and the variance are unknown, 
there is a price to pay for adaptation under fixed design. All these behaviors are depicted on Figure 
1. 



6. Inverse problem and support estimation 
6.1. Minimax risk of estimation 

We recall that the minimax risks of estimation for the inverse problem TZIp[k, X], TZTp[k], TZTji[k, E], 
and lZIn[k] have been defined in Section 3.3. 



6.1.1. Fixed design 



First, we consider the problem (P3) for a fixed design regression model. The minimax risk of 
estimation over Q[k,p] with a design X is noted lZTp[k,~K\ and is defined in (3.8). Raskutti et 
al. [39] have recently provided the following bounds 



Ci 



klog(epfk) 

^2k/\p,+ (X) 



< TZX F [k,X] < C 2 



fclog (ep/k) 
~ (X) 



<2> 



2fcAp,- 



(6.1) 



that holds for any fixed design X and any 1 < k < n. The lower and upper bounds match up 
to the factor ^2fcAp,+ (X)/$2fcAp,-(X). The upper bound is achieved by least-squares estimator 
over 0[fc,p] [39]. If the restricted eigenvalues of X are close to one, then the minimax risk is of 
order fclog(ep/fc). Next, we improve the lower bound in (6.1) in order to grasp the behavior of the 
minimax risk for non orthogonal design. 



Proposition 6.1. For any design X and any 1 < k < n, we have 



TZl F [k,X\ > C 



1 



k log(ep/fc) 



2fcAp,- 



(X) <P li+ (X) 



(6.2) 



In order to interpret these bounds let us restrict ourselves to design X such that each column 
has y/n norm, as justified in Section 3.3. The collection of such designs is noted 2\ iP . Observe that 
X € TJ n p enforces ^1,+ (X) = n. 

In the sequel, we are interested in the smallest minimax risk TZIp[k, X] that is achievable if we 
can choose the n x p design X S 2? n ,p, that is we want to bound TZXp[k\ = infxex>„ TZTp[k, X]. 
The minimax risk TZXp[k] tells us the intrinsic difficulty of estimating a k sparse vector of size p 
with n observations. 
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Proposition 6.2. 

1. Assume that k[l + log(p/fc)] < Cn. Then, we have 



(6.3) 



This bound is for instance achieved for designs X that are realizations (with a high probability) 
of normalized standard Gaussian design. 
2. For any design X € T> n ^ p and any k < n A p/2, we have 



<£ 2fe ,_(X) <Cm 

3. For any k < n/A A p/2, we have 
Ci 



C 2 k/n 



(6.4) 



< Li — 



- log I -r- ) V - exp 

n \ k / n { 



n g \k 









) cxp 


C 3 -log 


(f)] 




n 





(6.5) 

Remark 6.1. The bound (6.3) tells us that the best minimax risk that is achievable in a non-ultra- 
high dimensional setting is of order k\og(ep/k)/n. The Lasso achieves the (almost optimal) risk 
bound k\og{p) / n under some assumptions on the design matrix. 

Remark 6.2. The lower bound (6.4) is of geometric nature. Combined with (6.2), it implies the 
lower bound of (6.5). In an ultra-high dimensional setting, it is not possible to build a design X such 
that<I>2k,+ (X) /<l>2k.- (X) is close to one (see Remark 5.8). In fact, the quantity <I <~^ k _(X) blows up 
because of geometric constrains. When k[l +log(p/fc)] is larger compared to nlog(n), both bounds 
in (6.5) are comparable and the minimax risk is of order exp[Cfc/nlog(p/fe)]. As a consequence, 
the inverse problem becomes extremely difficult in an ultra-high dimensional setting. 

Remark 6.3. While the quantity k\og(p/k) in (6.3) is due to the "size" of the parameter space 
Q[k,p], the exponential term of the minimax risk in ultra-high dimension is essentially driven by 
geometrical constrains on the design X. 

Proposition 6.3 (Adaptation to the sparsity and the variance). As in the prediction case, we 
consider the estimator 8 V (defined in Section 10.2.1). Assume that p > 2n. For any design X, any 
a > 0, any l<k< \_{n — 1) /4J , and any 9q € 0[fc,p], we have 



" Cl ^(x) log (f 



n k i ( e P 



(6.6) 



with probability larger than 1 — e n — C'/p. 

Remark 6.4. Although the bound (6.6) is in probability and not in expectation, it suggests that 
adaptation to the sparsity and to the variance are possible. 

6.1.2. Random design 

Let us turn to the Gaussian design case. We are interested in bounding TZIn[k, £] and lZIn[k] as 
defined in (3.9). 

Proposition 6.4. For any 1 < k < (n — l)/4, and any covariance E we have 



1 



k log(ep/fc) 



r , k log (ep k) 
< m R [k, E] < C 2 — —7=- exp 



C-. 



k (ep\ 



(6.7) 
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As long as k[l \og(p/k)] < n, we derive that TZTu{k\ := inf^eSp lZIn[k, E] satisfies 



C^log(|) <^[fc]<C 2 ^log(|) 



(6.8) 



We observe that TZXn[k] and TZTp[k] behave similarly in a non-ultra-high dimensional setting. 

Remark 6.5. [Ultra-high dimensional case] Proposition 6.4 does not allow to derive the order 
of magnitude of lZTn{k] in an ultra-high dimensional setting. While the upper bound in (6.7) 
is blowing up, the lower bound remains as small as k\og(p/k)/n. Nevertheless, we know from 
Proposition 5.1 that 



This suggests that TZX^{k] is blowing up in an ultra-high dimensional setting but the problem re- 
mains open. 

In the next proposition, we state the counterpart of Proposition 6.3 in the random design case. 

Proposition 6.5 (Adaptation to the sparsity and the variance). As in the prediction case, we 
consider the estimator 9 V (defined in Section 10.2.1 ). Assume that p > In. For any covariance £, 
any a > 0, any 1 < k < [(n — 1)/12J , and any 9q £ Q[k,p], we have 



with probability larger than 1 — e n — C/p. 
6.2. Consequences on support estimation 

We deduce from the minimax lower bounds for the inverse problem (P3) some consequences for 
the support estimation problem (P4) in a ultra-high dimensional setting. The case k[l + \og(p/k)] 
small compared to n has been studied in Wainwright [45]. 

Definition 6.1. For any p > and any k < p, the set C^.(p) is made of all vectors 8 in <d[k,p] 
such that 9 contains exactly k non-zero coefficients that are all equal to pj\fk. 

In a non-ultra high dimensional setting, Wainwright [46] has proved, that under suitable condi- 
tions on a design X € 2? n ,p, it is possible to recover the support of any vector 9q that belong to 
C^(p) with p of order of y/ 'k\og(p) j 'no . Here, we prove that p has to be much larger in an ultra-high 
dimensional setting. 

Proposition 6.6. [Support recovery is almost impossible] For any 

p 2 < Cl /n(ff 2k/n and 

any k < n A p/2, we have 



For any design X € T> n p it is not possible to recover the support of #0 with high probability, 
unless 6*o satisfies: 



This quantity is blowing up in an ultra-high dimensional setting and it can be much larger than 
the usual k\og(p)/n that can be achieved in a non-ultra high dimensional setting. 





(6.9) 



inf inf sup Pg 1 [m ^ supp(6» )] > l/(2e + 1) . 
xex>„, p m e ecl( P ) 
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As it is almost impossible to estimate the support of 9q m a- n ultra-high dimensional setting, 
we may aim to an easier objective. Can we choose a subset M of {1, . . . ,p} of size po < p that 
contains the support of 9q with high probability? This would allow to reduce the dimension of the 
problem from p to pq. Dimension reductions techniques are popular for analyzing high dimensional 
problems. We study here to what extent dimension reduction is a realistic objective: how large 
should be the non-zero components of 9q? How small can we choose po? 



Proposition 6.7. Consider a Gaussian design regression with £ 
that p>k 3 VC andn>C. Set 



I p and a 2 



1. We assume 



P 2 = log ) i 



n k l ( ep 
C2 n l0 HT 



There exists a universal constant < 5 < 1 such that for any measurable subset M of {1, ... ,p} 
of size po < p b , we have 



sup J 

0oGC£(p) 



supp(6» ) M > 1/8 



(6.10) 



In an ultra-high dimensional setting, it is therefore not possible to reduce the dimension of the 
problem to p s unless the square norm of 9q is of order exp[Ck/n\og(p)]a 2 . In (6.10), the number 
1/8 is of no particular significance. It can be replaced by any constant c £ (0,1) if we take an 
asymptotic point of view {{k,p,n) — > oo). 

Remark 6.6. In Proposition 6.7, we have taken the maximal risk points of view. If we put an 
uniform prior it on C?(p), it is possible to replace (6.10) by 



9 0> i {supp(0 o ) g M j 



> C 



where C is a positive constant. 



Remark 6.7. In order to shed light on the problem of dimension reduction, let us consider a 



simple asymptotic example: p n = exp(n 71 ) and k n = n 



(7iAi)+ 7 2 mth 7l > o and 72 > 0. If we 
i) 



assume that 9 n £ Q[k n ,p n ] is such that ||0n||p ^ exp(Cn 72+ ( 71 1 ^+), then it is not possible to find 
a subset M n of size exp(5n 71 ) that contains the support of 9 n with probability going to one, where 
5 is defined as in Proposition 6. 7. Consequently, we still have to keep at least exp(i5n 71 ) variables 
after the process of dimension reduction if we do not want to forget relevant variables! 



7. What is an ultra-high dimensional problem? 

Until now, we have stated that a problem is ultra-high dimensional when k\og(p/k) is large com- 
pared to n. It has been proved that in such a setting, estimation of 9q, support estimation and 
even dimension reduction become almost impossible. In this section, we numerically illustrate this 
phase transition phenomenon. This allows us to quantify on specific examples how large should be 
klog(p/k)/n for the phase transition to occur. 

First simulation setting. Following the example described in the introduction, we consider a 
Gaussian design linear regression model with p = 5000 and p = 200, n = 50, S = I p , and a = 1. 
We set the number of non zero components k ranging from 1 to 15. k being fixed, we take 9q 
such that (0 o )i = . . . = (0 o )fc = 4y / log(p)/n 1.30 (resp. 1.65) for p = 200 (resp. p = 5000) 
and (#o)fc+i = ••• = (9o) p = 0. As a consequence, we have ||6>o|| 2 = 16fclog(p)/n. The non-zero 
coefficients of #o are chosen large enough so that the support of 9q is recoverable when the problem 
is not ultra-high dimensional. Each experiment is repeated N = 100 times. 



Verzelen/ 'Ultra-high dimensional regression 



23 



Dimension reduction procedures. We apply the SIS method [25] to reduce the dimension to a 
set M s of size po = 50. We then compute the Power of the procedure, 

Card[M 5 n{l,...,fc}] 

Power := — . 

k 

The power measures whether the dimension reduction has been performed efficiently. 

We also compute the regularization path of the Lasso using the LARS [24] algorithm. Before 
applying the Lasso, each column of X is normalized. We consider the set M L made of the po 
covariates occurring first in the regularization path. We do not argue that SIS and the Lasso are 
the best methods here. We have chosen them because they are classical and easy to implement. 




2 4 6 8 10 12 14 

Sparsity 



Figure 4. Power of the dimension reduction procedures (SIS and Lasso) as a function of k. 



Results. The results are presented on Figure 4. When k is small, the dimension reduction problem 
is not ultra-high dimensional and the Lasso and the SIS methods keep all the relevant covariates. 
For large k, the both methods miss some of the relevant covariates. For p = 5000, there is a clear 
decrease in the power beyond k = 4. For p = 5000 and k = 8, both methods only have a power 
close to 0.5. In expectation, only four covariates belong to the sets M s and M L of size 50. For 
p = 200, there is not a so clear transition, but the power decreases slowly for k > 8. If there was no 
elbow effect in the minimax risk of estimation, then it would still be possible to recover the sup- 
port of 9o with high probability. Indeed, each non-zero component of 9o is larger than 4-^/log(p)/n 
which is detectable in a reasonable setting (see e.g. [46]). For instance, for k = 6 and p = 5000, 
H^ollp/cr 2 = 16/clog(p)/n w 16.4. Here, the elbow effect implies that even for a huge signal over 
noise ratio, it is impossible to reduce the dimension of the problem without forgetting relevant 
variables. 

Second simulation setting. We still take p = 5000, n = 50, £ = I p , a = 1, and k ranging from 
1 to 5. k being fixed, we take 9q such that (#o)i = ■ ■ • = (#o)fc = Ui/log(p)/n and (#o)fc+i = • • • = 
(0o)p = 0. Relying on N = 100 experiments, we estimate u* k the smallest u such that M L has 
a power larger than 0.9. u* k corresponds (up to the renormalization yJ\og(p)/n) to the minimal 
intensity of the signal so that the dimension reduction method does not forget relevant covariates. 

Results. The results are presented on Figure 5. For small k, u* k remains close to \/2. In contrast, 
we observe that u* k blows up at k = 5. We have not depicted Uq, but we have Ug > 100. These two 
simulation studies confirm that when k becomes large (in comparison to p and n), the dimension 
reduction problem becomes extremely difficult. 
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Sparsity 



Figure 5. Minimal signal ut as a function of k. 



Remark 7.1 (Rule of thumb). From these simulations and from other theoretical arguments (e.g. 
[27, 22, 45]), we derive a simple rule of thumb. We say that a problem is ultra-high dimensional if 



k\og(p/k) 



> 1/2. 



(7.1) 



For p = 5000 and n = 50 ; this corresponds to k > 4. Setting p = 200 and n = 50 yields k > 8. 
In practice, we do not know k in advance. Nevertheless, this criterion (7.1) helps us to know what 
is the largest sparsity index such that the statistical problem remains reasonably difficult in the 
minimax sense. 

8. Discussion 



As stated in Sections 4-6, the behaviors of the minimax separation distances and of the minimax 
risks become really different in an ultra-high dimensional setting. Apart from the test problem 
(Pi) with known variance and the problem of prediction (P2) with fixed design, all the other 
separations distances and minimax risks blow up when klog(p/k) becomes larger than n. 

This elbow effect has important practical implications: there is no hope of selecting the relevant 
covariates in an ultra-high dimensional setting, except if signal over noise ratio is exponentially 
large. Moreover, even dimension reduction techniques cannot work well in such a setting. 

In linear testing (Pi), we have proved that the optimal separation distances highly depend on 
the knowledge of the variance. Most of the testing procedures in the literature rely on the knowl- 
edge of a 2 . Some specific work is therefore needed to derive fast and efficient procedures under 
unknown variance (but see [34] for a procedure in a specific situation). 

We have not discussed so far the problem of variance estimation. From the minimax lower 
bounds of testing, we deduce the following lower bound. 

Proposition 8.1. Assume that p > n > C . For any 1 < k < p 1 ^ , there exist designs X such that 









) cxp 


C 2 - log 


(I)] 




n 





inf sup E 6ojC 
a o->o, e O £0[k, P ] 

As a consequence, the problem of variance estimation becomes extremely difficult in an ultra- 
high dimensional setting. 
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In Propositions 5.3 and 6.1, we have provided minimax lower bounds for (P2) and (P3) over 
0[fc,p] for arbitrary designs X. Our corresponding upper bounds match these lower bounds when 
the restricted eigenvalues of X T X are close to each other. However, these bounds do not agree any- 
more when these restricted eigenvalues are away from each other. Deriving the exact dependency 
of the minimax risks on X would require sharper lower bounds and the analysis of new estimation 
procedures. 

Our minimax results use the Gaussianity of the noise e and the Gaussianity of the design X in 
the random design setting. In an ultra-high dimensional setting, the minimax upper bounds do not 
seem to be robust with respect to the Gaussianity. In smaller dimensions (k[l + log(p/fc)] < n), 
the Gaussian distribution of the design is less critical. For instance, consider a design X where all 
the components are independent and follow a subgaussian distribution. By a result of Rudelson 
and Vershynin [41] , the restricted eigenvalues of X T X remain away from with high probability. 
Consequently, some of the minimax bounds should still hold for subgaussian designs. Nevertheless, 
the derivation of sharp minimax bounds for non-Gaussian designs and noises remains an open 
problem 

9. Proofs of the minimax lower bounds 

Some propositions contain both minimax lower bounds and upper bounds. This section is devoted 
to the proof of the main lower bounds, while the upper bounds are proved in Appendix B in [43]. 
In order to keep our notations as short as possible, we set 

r, = 2(1 - a - 5) . 

We also note ||.||rv for the total variation norm. For any subset T C MP, a <G (0, 1), covariance 
matrix S, and any variance a 2 , we denote /3|? Q (7~) the quantity 

0E«ra(T) : = inf SU P P«rtfo^[*a = 0] , 

' ' *° e er 

the infimum being taken over all tests $ Q satisfying Po p ,o-[3>a = 0] < a. Its counterpart for unknown 
variance is defined by 

a 0~) ■= inf sup PaSoA®* = 0] , 

*° cr>0, 8 <ET 

the infimum being taken over all tests $ Q satisfying sup CT>0 IV,,a-[^c« = 0] < a. Similarly, we define 
/5| ja (T) for fixed design and a (T) for fixed design and unknown variance. 

Most of the minimax lower bounds in this paper are based on an approach which goes back to 
Ingstcr [28, 29, 30]. The following lemma encompasses fixed and random design and known and 
unknown variance. 

Lemma 9.1. Let T be a subset ofW\ {0 P } and let a and ag be two positive integers. Consider /z 
a probability measure on o~l~ := {o~6, 9 g T}. We note P^.o- = J^Wg^d/j, and L M = dP^^/dPo p ^ . 
Then, 

Pa(T) > l-a-^HP/^-IVJlTV. 

> l-«-i(E 0p , CTo [^(Y,X)]-l) 1/2 . (9.1) 

Here, (3 a (T~) can be replaced by Px. a (T) or /3§ a (T). If we also have a = o~q, then /3 Q (T) can be 
replaced by /3f i<70ja (T) or ^^(T). 

We refer to Baraud [5] Section 7.1 for a proof and further explanations in a close framework. The 
main idea is to find a prior probability on T so that the total variation distance between P^ )(7 and 
Po p ,cr is as large as possible. We derive from Lemma 9.1 that (3 a (T) > S if E 0pj(To [L 2 SY, X)] < l+rj 2 . 
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9.1. Proof of the lower bound (4-1) in Theorem 4-1 

Proof of Theorem 4-1- By homogeneity, we can assume that a 2 = Va.r(Y\X) = 1. We first build a 
suitable prior probability fx p in order to apply Lemma 9.1. 

Let us take a set rh of size k uniformly in j\4(k,p) (defined in Section 2). Let £ = (£,j)i<j< P be 
a sequence of independent Rademacher random variables. Consider some p > 0. Define A = p/Vk 
and consider p p the distribution of the random variable = Y^jerh ^j e J- stands for the 

distribution of (Y, X) with 9q ~ p p and (7 = 1. Here, (ej)i<j< p is the orthonormal family of vectors 
of M p defined by 

( e j)i = 1 if i = j and (ej)j = otherwise. 
The likelihood ratio L flp (X, Y) = P Mp ,i/P 0p ,i writes 

||Y-X0^||2-||Y|£ 



L Mp (X,Y)=E e ,, 



exp 



where E^^ stands for the expectation with respect to the distribution of £ and rh 

-2 



In order to apply Lemma 9.1, we need to upper bound the expectation of L 2 (X, Y). Let us 



first take the expectation of L p/> (X, Y) with respect to Y. 



E 



o D ,i 



,-2k P 



; + 11X61 



E 



X 



^(D^(2) {exp ((X0 Al ^(D,X0 A2 ,£«}„)}] , 



)/2+(Y,X 



I1 , 5 (D+%, !: ( 



(9.2) 



where Ex stands for the expectation with respect to X while E rfli jfl2 ^ c{2) refers to the expecta- 
tion with respect to the independent variables ^ x >, ^ 2 \ mi and m?,. 



Lemma 9.2. If we assume that p 2 < C log (l 



, then we have 



E, 



0pA {Ll p (Y,X)|x} 



< 1 + V 



In this lemma, we have specifically distinguished the integration with respect to X from the 
integration with respect to Y. This will be useful for deriving minimax lower bound in fixed design 
(Proposition 4.2). Gathering Lemmas 9.1 and 9.2 allows to derive that 



(p* R [k,I P ]) 2 >C 



log 1 



k 2 



1 



This last bound allows to conclude since p > n 2 . 

Proof of Lemma 9.2. Let us fix mi, 7772, £^ and ^ 2 \ First, we shall compute the expectation 
Ex[exp((X0 mi;S (i),X0 m2iS(2 ))„)]. 



□ 



Let us decompose the set mi U 7772 into four sets (which possibly are empty): mi \ m2, mi \ mi, 

■(1) = t(2 

j ^3 



?Ti3, and ?Ti4, where 7773 and 7714 are defined by 7773 := {j 6 mi n 7772|£j 1 ' ) = £,j 2- '} and 7774 := {j 6 
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mi n m 2 |£j = — £j 2 ^} • For the sake of simplicity, we reorder the elements of mi U m-i from 1 to 
|mi U m,i\ such that the first elements belong to mi \ 777,2, then to m-i \ m\ and so on. 



Ex [exp ((X0 TOli £(i),X0 m2j £ (2 ))„)] 



/ (2^/ 2 exp(-£f 2 /2 + £ ^ 
•^ Rp \ i=i i<ij< P 



'|miUm2| 



X 2 C 



-n/2 



where L 



|miUm2| 



is the identity matrix of size |mi U m2\ and C is block symmetric matrix of size 



mi U m,2| defined by 



C 



10 11 
111 
112 
110-2 



Each block corresponds to one of the four previously defined subsets of mi U m,2 (i.e. mi \ 7712, 
1^2 \ rn>i, "7,3, and 7774). The matrix C is of rank at most four. Hence, I\ mi um 2 \ ~ A 2 C has the same 
determinant as the matrix D of size 4 defined by: 



D 



1 



^|mi\m 2 | ~%l ms 

i_^| m2 \ mi | _^| ma 

'mi \ m 2 | 



ft 



ft 



\m 2 \mi\ l-2^-|m 3 | 



-ir\ m i\ 


l + 2^|m 4 | 



-^-|mi\m 2 | ~vl m 2\ TO il 
After some computations, we lower bound the determinant of D 

\D\ > l-2(2|m 3 |-|m.inm 2 |)A 2 -8p 4 . 
From now on, we assume that p 2 < 1/20 so that \D\ > 1/2. Hence, we get 

E x [exp((X0 mii?(1) ,X0 m2 ^ (2 ))„)] < [l-2(2\m 3 \-\m 1 nm 2 \)X 2 -8p 4 ]~ n/2 



< 



exp (8np 4 ) exp [2nA 2 (2|m 3 | - |mi nm 2 |)] . (9.3) 



Then, we take the expectation with respect to ^ 2 \ mi and 7772. When mi and 7772 are fixed 
the expression (9.3) depends on t;^ and ^ 2 ^ only through the cardinality of 7773. As £D and ^ 2 -* 
follow independent Rademachcr distributions, the random variable 2|m3| — |mi n m 2 | follows the 
distribution of Z, a sum of |mi n m2| independent Rademacher variables and 



E 



x 



E 



o„,i 



I i 2 p (Y, X) X j < exp (8np 4 ) E z [exp (2n\ 2 Z)] 



(9.4) 



where E z stands for the expectation with respect to Z. We now proceed as in the proof of Theorem 
1 in Baraud [5] in order to upper bound the term 



E z [exp (2?7A 2 Z)] = 



cosh (2nA 2 ) 



\m 1 nm 2 \ 



Following Baraud's arguments, we get that Ez [exp (2nA 2 Z)] < y^l + rj 2 when 
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Moreover, we have exp(8p 4 n) < yl + rf as soon as p 2 < C/^/n since r\ > 0.94. Gathering these 



observations with (9.4), we conclude that Ex 

P 2 <C 



Eo pll {Lj p (Y,X) X} 



< 1 + r/ 2 as soon as 



□ 



9.2. Proof of the lower bound (4-8) in Theorem 4-3 

Proof of (4-8) in Theorem 4-3- Consider the Condition 

We deduce Theorem 4.3 from the following result. 
Lemma 9.3. Suppose that a + 5 < 53%. We have 

/3t Q ({0 o ee[fc, P ]JMp = p 2 }) >s , 

/or am/ p 2 > such that 



P 2 <^lcg(l 



fc 2 



//we assume that Condition (A..1) holds, (9.5) holds for any p > such that 



(9.5) 



(9.6) 



(9.7) 



If p > k 3 V C and fclog(p)/n > C\ with C and C\ large enough, then Assumption (A.l) is 
satisfied. For C large enough, the quantity fclog(p)/ log(fc) is large enough so that the lower bound 
(9.7) satisfies 



> — 1 + exp 



C-log(p) 



> Ci - log (p) exp 



C 2 -log(p) 



Let us now assume that p > k 3 V C and k log(p) /n < Ci where Ci has been previously fixed. Then, 
the first lower bound (9.6) satisfies: 



^ l0g| 



C 2 -log(p) 



Gathering the two previous lower bounds with Lemma 9.3 allows to conclude. 



□ 



Proof of Lemma 9.3. Consider some p > 0. To apply Lemma 9.1, we first have to define a suitable 
prior p p on 0q and a suitable a 2 . More specifically, we set a 2 = (1 + p 2 )^ 1 and the distribution u p 
is supported by &[k,p, p] defined by 



e[k,p, P ] 



o eQ[k, P } , ||0 O | 



Let m be a random variable uniformly distributed over M(k,p). Let p p be the distribution of 
the random variable = X^em where 
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and where {&j)i<j<p is the orthonormal family of vectors of R p defined by (ej)i = 1 if i = j and 
( e i)j = otherwise. By Lemma 9.1, we only have to prove under conditions (9.6) or (9.7) with 
(A.l), we have 

E 0p , 1 (L2 p (Y,X))<l + ?7 2 . (9.8) 

Observe here that we use a variance 1 for Ho and a variance 1 — \\9q\\ 2 for the hypothesis Hi. Using 
these two different variances allows us to take advantage of the fact that we work under unknown 
variance. 

As a specific case of [44] Eq.(8.5), we have 



E 0p ,i(^(Y,X)) 



E; 



mi,m,2€Ai(k,p) 

(l + P 2 )k 



p 2 \mi n m 2 \ 
(l + p 2 )k 



where Z follows an hypergeometric distribution with parameters p, k, and k/p. We know from 
Aldous (p. 173) [2] that Z follows the same distribution as the random variable E(T4 7 |i3 p ) where W 
is a binomial random variable of parameters k, k/p and B p some suitable tr-algebra. By a convexity 
argument, we get 



E; 



P 2 Z 



(l + p 2 )k 



< E 



ir 



p 2 W 
{1 + P 2 )k 



(9.9) 



Hence, we only need to upper bound the expectation of the second random variable. 
CASE 1: Proof of Equation (9.6). Since log(l + x) < x and since W < k, we have 



E 



ir 



1 - 



p 2 W 
(1 + P 2 )k 



< E 



< 



exp 
k 



np 2 W/k 



1 + p 2 - p 2 W/k 



< 



E w [exp (np 2 W/k)] 



1 + - ( e np ' k - 1 
P 



< exp 



k 2 2 , 

— (e n " /k - 1] 

P 



As a consequence, the condition (9.8) holds if p 2 < | log [l + ^ log(l + i] 2 )] . Observe that log(l + 
V 2 ) ^ 0-6- Since log(l + ux) > ulog(l + x) for any < u < 1 and any x > 0, the last condition is 
enforced by p 2 < ± log [l + $] . 

CASE 2: Proof of Equation (9.7). Here, we bound (9.9) under condition (A.l). We have 

k 



E 



ir 



1 - 



p 2 W 



(l + P 2 )k 



- 1 



2 ■ 
pi 



(l + p 2 )k 



< STF[W>i}[l- — 

i=l ^ ^ 

Since we need to ensure that Ejy[{l — p 2 W/((l + p 2 )k)}~ n — 1] < rj 2 , it is sufficient to prove that 

(9.10) 
(9.11) 



I 1- - 



2 '—% 

< for any 1 < i < [k/2\ , 



[W>i][l- 



p Z l 



< i- for any \k/2\ + 1 <i< k 



{1 + P 2 )k, 

In order to prove these bounds, we shall use a deviation inequality of the random variable W/k. 
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Lemma 9.4. For any k > 1, < x < 1, it /io/ds i/iai 



W 



< 



J3X ) (1 — x) 1 x 

FACT 1. For any 1 < i < \_k/2\, the upper bounds (9.10) hold under Condition (A.l). 
FACT 2. The upper bound (9.11) holds for any [k/2\ + 1 < i < k as soon as 



(9.12) 



p \ k /' n I 7f 



2ekJ 



2\ 2/" 



2/v 



(9.13) 



We derive that under (9.13), we have E 0pi i[L^ p (Y, X)] < 1 + rj 2 . The fact that rj 2 > 1/2 allows to 
conclude. 

□ 

Proof of FACT 1. Since log(l-x) > -x/(l-x) for any < x < 1, we derive that (l-x) 1 -* > e~ x . 
Gathering this bound with Lemma 9.4, we get a new deviation inequality for W. 



W 
~~k 



> x 



< 



kc 
px 



xk 



(9.14) 



for any x < 1. We apply this bound with x = i/k. Then, Inequality (9.10) holds if 



P 



l/n 



< 1 - 



Taking the logarithm of this expression leads to 

1 , / . , ox i/k 



MJ?) + ^(W) +T 



-i/k 



<0, 



Since i is constrained to be smaller than fc/2, we get 



ik 
n 



-log +-log (4/r ] 2 )+2i<0 



By Assumption (A.l), k/n\og[p/(ek 2 )} is larger than 2. Consequently, the worst case among all i 
between 1 and k/2 is i = 1. Hence, we only need to prove that 



log (IO- log u? 



> 2 



Since 77 is larger than 0.94, log(4e/7y 2 ) is smaller than 3 and this last inequality is ensured by 
Assumption (A.l). 

□ 

Proof of FACT 2. We consider here the case 1/2 < i/k < 1. We derive from (9.14) that 

F[W>i]< (— 

V p 

Consequently, we want to ensure that 



Verzelen/ 'Ultra-high dimensional regression 



31 



for any i between \_k/2\ and k. For any x and u between and 1, (1 — x) u < (1 — xu). Setting 
u = i/k and x = p 2 /(l + p 2 ), we obtain that the last inequality holds if 



1 - 



> sup 



1 + P [k/2\<i<k \ P 



2ek\ k/n f2k\ 



k/ (in) 



V 2 J 



Since 2k /rj 2 is positive, the largest term in the bound corresponds to i = k/2. Hence, it remains to 
prove that 



1 + p 2 



> 



2ek\ k/n /2k^ 2/r 



We conclude that the upper bounds hold if 



P 2 < 



\2ek) \2k. 



□ 



Proof of Lemma 9.4- We prove this deviation inequality using the Laplace transform of W/k. Con- 
sider some x £ (0, 1) and A > 0. 



log 



W 

T 



< -\x + log [E w {exp(AWyfc)}] < - Ax + k log 



1 + 



Deriving with respect to A an upper bound of the last expression leads to the following choice 
Hence, we get 



log 



ir 
T 



> X 



< — kx los 



x ( p 



1 — x \k 



1 



fclog 



1 - k/p 



1-x 



Since we assume x < 1, we conclude that 



W 



> x y < 



i k 



px J (1 — a;) 1 x 



Since P(W = k) = [k/p] k , this upper bound is also valid when x = 1. 



□ 



9. 3. Proof of Proposition 5. 1 

We derive this minimax lower bound from the hypothesis testing problem {#o = P } studied in 
Section 4. Since the covariance S = I p , the loss E [{-X"(0i — 2 )} 2 /o- 2 ] is simply — ^Hp/c 2 . For 
the sake of simplicity, we assume that p is even. We split the p covariates into two groups M\ and 
Mi of size p/2. Given some p > 0, we fix a 2 = (1 + p 2 )^ 1 and we consider the two sets 



9 : supp(0) C Ah and \\6\\ 2 p = -^—^ 
9 : supp(0) C M 2 and \\8\\ 2 p = -^—^ 
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Take any estimator 8. We consider an estimator 9 £ &i[p] U 02 [p] such that 



eeei[ P ]ue 2 [p] 

By the triangle inequality, we have \\6 — 0q|| p < 2||0 — #o||p; f° r any 9 £ &i[p] U 02 [p] 



sup sup Eg 0jO . 
i=i,2e ee i [ / o] 



> >i_ sup sup P e0)CT [ SU pp(6i) £ M,-] 

4 i=l,2 e £8i[p] 



(9.15) 



It is enough to prove that for p 2 = C\ — log (p) exp{C2- log (p)}, the supremum of the probabil- 
ities Pg 0j(J [supp(#) £\ Mi] is lower bounded by a positive constant. This is equivalent to lower 
bounding the minimax separation distance for Ho : {9q <E 0i[p] and a 2 = (1 + p 2 )^ 1 } against Hi: 
{e £& 2 [p} and a 2 = (l + p 2 )- 1 }. 

As in the proof of Theorem 4.3, we build a prior distribution p\^ p on 9q. Consider the collection 
A4i(k) of subsets of Mi of size k. Let m be be some random variable uniformly distributed over 
A4i(k). Then, p\ p is the distribution of 9 = X^em Pi v^(1 + P 2 ) e j- Similarly, we define the prior 
distribution p 2yP on 02 [p]- We note a = f Pg 0!a dpi :P . We have 



sup sup P eo [supp(6>) £ > 1- 

i=l,2 6» eei[r] 



1 



- f P2, p ,<tIItv ■ 



2 II Mi,P) CT 

> l-|lPMi.p,<r-Po B ,l||TV 



(9.16) 



by the triangle inequality. Lemma 9.1 states that 

HP/fc.p-PiwIlrv <E«> p ,i 



^7, - 1 



where L P1 = dP^ tl p [T /dPo p .i- In fact, the second moment of p has been studied in the proof 
of Theorem 4.3. If we take a + 5 = 53% in this proof, we derive 



En 



7-2 



< 1.9 



if 



< Cifclog(p)/nexp(C 2 fclog(p)/n) and if p > fc 3 V C 3 



1 -p 2 

Gathering this result with Equations (9.15) and (9.16) allows to conclude. 



9-4- Proof of Proposition 5.6 

Let us set a = S = 0.01. Consider a design X that achieves the bound (4.11) and take p = 
p* F v \k, X]/2. If klog(p)/n is large enough, then p > Take any estimator 9 that does not rely 
on the variance a 2 . Let us build a test T of the hypotheses Ho: {9o = and a > 0} against 
Hi: {9 £ e[k,p] and a > 0, ||X0 o ||£/(r^ 2 ) > p 2 }: 




if 2||X0||2 < ||Y| 

1 if 2||X0|| 2 > ||Y| 



By Proposition 4.4, we have at least one of the two following properties: 

supP 0p . ff (T = 1) > a (9.17) 

<T>0 

sup P eo , CT (T = 0) > <5 (9.18) 

<j>o, 6» ee[fe, P ], ||xe ||2/(TMT 2 )>p 2 
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CASE 1: (9.17) holds. Wc have F 0p , CT [|l Y|| 2 > na 2 /2} > 1 - e~ n / 16 for any o > 0. Thus, there 
exists a > such that ||X0||^ > ?ier 2 /4 with probability larger than a/2 — e~™/ 16 . As a consequence, 
we have 



sup E 0p , 

(7>0 



||X(0-0 o )||*/[na 2 



> C 



CASE 2: (9.18) holds. The random variable ||Y|| 2 /cr 2 follows a noncentral x 2 distribution with n 
degrees of freedom and a non centrality parameter ||X0o|| 2 /c 2 - By Lemma 1 in Birge [9], we have 
||Y|| 2 < 3/2 [ncr 2 + IjX^ol] 2 ] , with probability larger than 1 — e~ Cn . Consequently, there exist 
a > and O € such that ||X6» || 2 /(ncr 2 ) > p 2 and 



||X0|| 2 /(na 2 ) < \ [1 + ||X0 o || 2 /(na 2 )] < 7 -\\X9 \\ 2 J(na 2 ), 



with probability <5/2 



, since p 2 > 2. Thus, we get 



|X(0-0„)|l»/» 



> E fl 



|X0|| n -||X0, 



On 



/n 



2_2 



> C||X0 o ||;/n >c P v 



9. 5. Proof of Proposition 8. 1 

For the sake of conciseness, we note 1(9, a) = \9 2 /a 2 — a 2 /9 2 \. Given a positive number p, we note 
(To = (1 + P 2 ) As in the proof of Theorem 4.3, we consider the prior probability p p on 9[fc,p]. 
For any estimator 9 > 0, we define a by a € argrriin a . e r l o . i /(<?, er). For any cr G {1, uo}, the loss 
1(9, cr) is controlled as follows: 

1(9, a) > l^JO-iV^o) ■ 
Thus, we get the minimax lower bound 

inf sup Eg 0<a [l(9,a)] 
a <r>o, e ee[k, P ] 

> inf max [E 0p ,i {1(9, 1)} ; E Mp . CTo {Z(<t, ct )}] 

(T>0 

> l(l,y/ao)_vrf max[P 0f> ,i[3r^l];P CTo p^(To]] 

cre{l,cr } 



> 



> 



i(i, V^) 



» P ,l\\TV 



2vT 



1 - 



l--(E 0p4 [L 2 D (Y,X)]-l 



1/2 



(9.19) 



Let us note two numbers rji = 1.5 and 7/2 = 1.8. If X is a standard Gaussian design and if k < p 1 / 3 , 
then the proof of Theorem 4.3 states for 



P 2 < Ci-log^|^ exp 



C 2 -Ioe(? 



we have Eo pj i[i^ (Y, X)] < 1 + r/ 2 where the expectation is taken both with respect to Y and X. 
Applying Markov's inequality, we derive that with positive probability, 



E 0p , 1 [i 2 p (Y,X)|X]<l + r ? 2 . 
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For such designs X and such p we have 



inf sup Ee .cr PC^c)] > G ~J= 9 
CT cr>o, e ee[fc, P ] v 1 + P 



>C (pAp 2 ) 



since p /yjl + p 2 >(pA p 2 )/\/2. We conclude that 



inf sup E 9oi(T [Z(<7, cr)] > C[- log ( - ) exp 

5 <7>o, e ee[t P ] Vfc/ 



9.6. Fano's Lemma 

The next lower bounds are established applying Birgc's version of Fano's Lemma [10]. More pre- 
cisely, we shall use the following lemma, which is taken from Corollary 2.19 in [37], 

Lemma 9.5. Let (S,d) be some pseudo metric space, {P s ,s £ S} be some statistical model. 
Let us note k = 2e/(2e + 1). Then, for any estimator ~s and any finite subset C of S, setting 
S = min S) t 6 c, s ^td(s,t), provided that max st JC(¥ S , P t ) < Klog|C| the following lower bound holds 
for any p > 1 : 

supE s [d p (s,s)} >2- p 6 p {l-n) . 
sec 

9.7. Proof of the lower bounds of Propositions 6.1 and 6.4 

This lower bound is based on Fano's lemma. For the sake of simplicity, we assume that 2k < p 
and that a 2 = 1. First, we consider a unit vector 6 £ Q[2k,p] such that ||X0||„ = #2&,-(X). Let 
us define k = 2e/(2e + 1). It is possible to find two vectors (61,62) £ 0[fc,£>] such that 61 — 62 = 
6\J2k log(2)/^2fe,- (X) and supp(0i)nsupp(0 2 ) = 0- Consequently, the Kullback distance /C(£>i, #2) 
between the two distributions ¥g 1 and Pg 2 is exactly Klog(2) and || 6*i — 6*2 1 1 ^ = 2relog(2)/$2fc,-(X). 
Applying Lemma 9.5, we derive the first part of the lower bound: 

Let us turn to the second part of the lower bound. We consider A4(k,p) the collections of subsets 
of {1, . . . ,p} of size k. Applying combinatorial results such as Varshamov's lemma and Lemma 4.10 
in [37], we derive that there exists A4'(k,p) C M.(k,p) of size larger than exp[Cfc \og(ep/k)} such 
that any pairs of distinct sets mi, m 2 in A4'(k,p), we have \mi PI TO3I < 3/c/4. 
For any to £ Ai'(k,p), we define a vector 6 m that satisfies: 

• |(0m)i| = 1/Vk if i G m and else. 
. ||X0 ro ||3 < #1,+ (X). 

Let us prove that this construction is possible by induction on k. The construction is straightforward 
for k = 1. Assume that this construction is possible for k— 1. Let us take some subset m £ M.(k,p) 
and mf c m such that \m'\ = k — 1. There exists a vector 6 such that supp(#) = to', \(6)i\ = 1/Vk 
for any i £ m! and ||X0||^ < #i i +(X)(fc — l)/k. Now consider the two vectors 61 and # 2 such that 
(6i)i = (6 2 )i = 6i if i £ m', (6i)t = -(6 2 )i = l/Vk if i £ to \ to' and (0^ = -(0 a )< = else. It 
follows that ||X6»i||2 < $ h+ (X) or ||X0 2 ||£ < *i,+ (X), which allows to conclude. 

For any r > 0, we consider the set C' k [r] := {r$ m , m € A4'(fc,p)}. The Kullback distance 
between any two element 61 7^ 2 in C' k [r] is upper bounded as follows: 

K[6lM = Wli_«< 2 ^ +( x)4, 
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while we have ||#i — # 2 ||p > r 2 /2. Applying Birge's version of Fano's lemma [10] we conclude that: 



inf sup ^o ,<j 

9 8a£Conv[Cl(Vkr)] 



\\X(0 - o )\\l/n 



> C 



r 2 A fc(l+lQg(p/fc)) j2 



*ifc,+ (X) 



where Conv[A] stands for the convex hull of A. Taking r 2 = k[l + log(p/ k)]a 2 /<£ 2 fc !+ (X) allows to 
conclude. 

The proof of the minimax lower bound (6.7) in Proposition 6.4 follows exactly the same steps. 
The minimax lower bound (6.8) is a consequence of (6.7) and the fact that ^i j+ (v / S) = 1 for any 
S £ S v . 



9.8. Proof of Proposition 6.2 

Proof of the first result. First, the minimax lower bound is a straightforward consequence of 
(6.2), since #i i +(X) = n if X £ T> n ^ p . Let us turn to the upper bound. Thanks to the minimax 
upper bound (6.1), we only have to prove that there exists a design X such that its 2fc-restricted 
eigenvalues remain close from n. 

Consider a standard Gaussian design W of size n x p. Rescaling to a norm of yfn each column 
of W, we get a design X £ T> nyP . Let us assume that k[l + \og(p/k)} < {4(1 + \f2)}~ 2 n. Applying 
Lemma 11.2, we control the restricted eigenvalues of W: 

<2> 2 fc,+ (W/V^) < (7/4) 2 and <2> 2fe ,_ (W/ V^) > (1/4) 2 , 

with probability larger than 1 — exp(— n/32). Consider any 8 <G <d[2k,p] such that \\8\\ p = 1. By 
definition of X, there exists some 8' € <d[2k,p] such that X# = W0'. Moreover we have 

\\8'\\ 2 p >^X(W/V^) ■ 

Hence, we derive that 

*a*.-(X) >^>2fe,-(W)^+( W /V") • 
Thus, we have #2fc,-(X) > n/49 with positive probability. 

Proof of the second result. Let X be a design in T> n p . Take 8 £ (0,1]. Let us consider the 
collection A4(k,p) (defined in Section 2). As explained in the proof of Proposition 6.1, there exists 
M'(k,p) C M(k,p) of size larger than exp[Cfc \og(ep/k)] such that any pairs of distinct sets mi, 
77J2 in M.'{k,p), we have \m\ n mz\ < 3fc/4. 

For any m £ M.'(k,p), we define a vector 8 m such that |(6* m )i| = l/Vk if i £ m and else and 
that HX^mlj 2 < n. Such a construction is justified in the proof of Proposition 6.1. 

For any mi ^ m,2 in A4'(k,p), we have \\6 mi — 8 m2 \\ 2 > 1/2. If there exist two distinct sets 
(mi,m 2 ) £ M'(k,p) such that ||X(0 mi — ^ m2 )|| 2 , < nS 2 , then the design X satisfies <£ 2 fc,-(X) < 
2n5 2 . A necessary condition for X to satisfy <£ 2 fc _(X) > 2nS 2 is therefore that the vectors X0 TO 
are ^/n^-separated. 

If X satisfies <? 2 fe._(X) > 2n6 2 , then the balls in R n with radius y/nS centered at X0 m are all 
disjoint. Thus, the sum of their volumes, is smaller than the volume of a ball a radius \/n(l + 5) 
in W 1 . This implies that 8 < 2(k j ep) Ck I n . Hence, for any design X with unit columns, we have 

/ k \ C2k / n 
<^,-(X) <d (-J 

which allows to prove the second result. 
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Proof of the third result. The minimax lower bound is direct consequence of (6.2) and (6.4). In 
order to finish the proof, we shall combine the minimax upper bound (6.1) with an upper bound of 
infxez>„ p ^2fe- (X)- Consider a standard Gaussian design X with size nxp. Applying the deviation 
inequality (11.3) of Lemma 11.2, we derive that with probability larger than 1 — l/p, we have 



2A-. 









-log 


(§H 




n 





However, the design X does not belong to T> n p . This is why we consider X' = XD -1 , where D is a 
diagonal matrix of size p, whose Z-th diagonal clement corresponds to the norm of the /-th column 
of X/y 7 ^. Obviously, X' belongs to 2?„. p . 



# 2fc ,_(X') = inf KJ^= inf 
9ee[k, P ] 



|X0| 



mi 



> 



*2fc,-(X) 



eeefM ||D0||f " (^ max (^ 2 ) ' 

Each diagonal element of nZ) 2 follows of \ 2 distribution with n degrees of freedom. Applying 
Lemma 11.1, we derive that <f m ax(D) < C\J\ V \og{p)/n with probability larger than 1 — l/p. We 
conclude that 



< Cm 



p\C 2 k/n) 
k) 



log 



V 1 



IV 



log(p) 



< 



'p\ C 2 k / n ) 

Je) 

with probability larger than 1 — 2/p. This allows to conclude. 



9. 9. Proof of Proposition 6. 6 

For the sake of simplicity, we assume that <r 2 = 1. Consider a design X € T> n ^ p . By the proof of 
Proposition 6.2, there exist two vectors 9\ and 2 such that: 

1. 9 1 and 2 contain exactly k non-zero components which are all equal to 1/y/k in absolute 
value. 

2. The Hamming distance between 9\ and 62 is larger than fc/2. 

3. ||X(0i - 6 2 )\\l < <7inexp[-C 2 fe/nlog(ep/fc)] := p*~ 2 . 

Let us set 0* = Cp*0\ and 9% = Cp*02 with C = 4 log(2)e/(2e + 1). Consequently, the Kullback 
discrepancy between Pg* and Pg* is smaller than log(2)2e/ (2e + 1). Consider an estimator 9 taking 
its values in {#*, fl^}. Applying Corollary 2.18 in [37] (which is another version of Fano's Lemma), 
we derive that inf^g^.^.j Pe ,i(Q = #o) < 2e/(2e + 1). This allows to conclude. 



9. 1 0. Proof of Proposition 6. 7 

For the sake of simplicity, we assume that <r 2 = 1 and that p is even. Consider any estimator M of 
size po- We set 



3 2 = ^--log(p)exp 



2 n 



C 2 fc . 
2 n 



(9.20) 



where the constants C\, C2 correspond to the ones used at the end of the proof of Proposition 5.1. 
We also consider the set C%(p). Suppose that we have 



sup Pe ,i[supp(e» ) c A/] > 7/8 

So£C p k (p) 



(9.21) 



Verzelen/ 'Ultra-high dimensional regression 



37 



Assume we are given a second n-sample of (Y, X) independent of the first one. We note (V, X') 
this new sample. We consider the estimator 9 k defined by 



? fc :=arg min _J|Y' - X'. 

0S©[fc,p] and supp(9)CM 



Since £ = I p , all the covariates that do not lie in the support of 9q play a symmetric role in the 
distribution of (Y, X). This estimator Ok has the same form as the estimator 9k introduced in 
(10.5). Arguing as in the proof of Theorem 5.2, we derive that 



%\\ll 



P supp(0 o )CM 



< Cifclog 



(?) 



exp 



\ k ) 



with probability larger than 7/8. Gathering this bound with (9.21), we derive that for any 0q £ 
C k (p), we have 



||^-^og<C^log(^°)exp 
with probability larger than 3/4. 



Gi-Io; 



\ k J 



(9.22) 



We shall prove that (9.22) is impossible if po is too large. Let us split the p covariates into two 
groups Mi and M 2 . We consider the subsets C kl (p) (resp. C k2 {p)) of C k {p) whose elements have 
their support in M\ (resp. Mq). Arguing as in (9.15) and (9.16), we derive that for any estimator 
0, there exists 6q £ C k 1 (p) U C k2 (p) such that 



,2 ^ P 2 C X h 



-0o||;> ^- = ^-log(p)exp 
F 4 8 n 



C 2 k . 

— lo &(P) 
2 n 



with probability larger than 1/4. Here, the constants C\ and C2 are the same as in (9.20). 

The last lower bound contradicts (9.22) is log(po)/ log(p) < 6, where 5 > depends on the 
relative values of C x , C 2 , C[, and C 2 in (9.20) and (9.22). 



10. Procedures involved in the proofs of the minimax upper bounds 
10.1. Testing procedures 
10.1.1. Known variance: test T* 

In order to establish the minimax upper bounds for known variance, we consider the following 
testing procedure. It is taken from Baraud [5] who applies it in the Gaussian sequence model. In 
the sequel, Xk{u) denotes the probability for a \ 2 distribution with k degrees of freedom to be 
larger than u. Given a subset m of {1, . . . ,p\, II m refers to the orthogonal projection onto the 
space generated by the vectors (X;); em . 

Definition 10.1. [Procedure T*] Define k* as the smallest integer such that fc*[l + log(p/fc*)] > 
y/n. For any 1 < k < k* , we define the statistics T* k by 

T* ik := sup ||II m Y||2 -a 2 ** >/(*)] , 

where M.(k,p) is defined in Section 2. We also consider 

T*, n :=\\Y\\l-a 2 Xn\a) . 
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The procedure T* is defined by 



T:=[Vi<k<k>T* m , )>k \ vr; A „ . (10.1) 

The hypothesis Ho is rejected if T* is positive. 

T* k corresponds to a Bonferroni multiple testing procedure based on a large number of para- 
metric tests of the hypothesis Ho: {9o = p } against Hi m : {9q 7^ and supp(#o) C m} for any 
m e M(k,p). As a consequence, T* k allows to test the hypothesis H o :{# = 0} against Hi^: 
{9q S \ {Op}}- Then, T* corresponds to a Bonferroni multiple testing procedures based on 

the statistics T* k , k <G {1, . . . k*} U {n}. Obviously, the procedure T* is computationally intensive. 
It is used here as a theoretical tool to derive minimax upper bounds. 

10.1.2. Unknown variance: test T a 

We introduce a second testing procedure to handle the case of unknown variance a . 

Definition 10.2. [Procedure T a ] Fixing some subset m of {1, . . . ,p} such that n — \m\ > 0, we 
note d m (X.) the rank of the subdesign X m o/X of size n x \m\. We define the Fisher statistic 4> m 
by 

WY ' X) - = MxlFW' (10 ' 2) 

We build the statistic T a ^{Y ^,X) as 

T a , k := sup d> m (Y,X)-F-^ )in _ dm (x) [a/(J)] , (10.3) 

meM(k.p) 

where Fk t n-k{u) denotes the probability for a Fisher variable with k and n — k degrees of freedom 
to be larger than u. Finally, the statistic T a is defined by 

T a := sup T a/ y n/2 \. k . (10.4) 

fc=l,...,|ra/2J 

The hypothesis Ho is rejected when T a is positive. 

In fact, T a is a Bonferroni multiple testing procedure. Contrary to T*, it is based on Fisher 
statistics to handle the unknown variance. The ideas underlying this statistic have been introduced 
in Baraud et al. [7] in the context of fixed design regression. 

10.2. Estimation procedures 

10.2.1. Definition of the estimator 9 V 

Definition 10.3. [Estimator 9 ] For any integer k € {1, ...,p}, we consider a least-squares 
estimator 9^ defined by 

fe earg min \\Y-X9\\ 2 n . (10.5) 

6ee[k, P ] 

Let us define the penalty function pen : {1, . . . , \_(n — 1)/4J} 1— > M + by 

pen(£0 = i^log(y) , (10.6) 
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where K > is a tuning parameter. The dimension k v is selected as follows 
k v G arg min log [||Y - X9 k \\l] + pen(fc) . 

l<fe<L(n— 1)/4J L 

For short, we note 9 V = 9^ v . 

This variable selection procedure relies on complexity penalization. The penalty pen(/c) depends 
on the size of k and on the number of subsets of {1, ... ,p} of size k. Observe that the estimator 
9 V does not require the knowledge of a 2 . 

The choice of the tuning parameter K is universal: it neither depends on n, p, k, nor on E, 6$, 
a 2 . It is only constrained to be larger than a positive numerical constant so that the equations 
(B.8), (B.24), (B.26), (B.31), and (B.34) in the proofs of Theorem 5.2, Propositions 5.5 and 6.3 
in [43] hold. 

10.2.2. Definition of the estimator 9 BM and proof of (5.6) in Proposition 5.3 

Definition 10.4. [Procedure for fixed design regression] Define k* as the smallest integer k 
such that k[l + log(p/fc)] > n. Let us consider the collection of dimensions K, := {1, . . . , k*} U {n}. 
Then, the penalty function pen : K. <-> R + is defined by 

^•):4 4 ^ 4+ 9 l0g(f)] t 

2n if k = n , 

We recall that for k < k* , the estimators 9^ are defined in (10.5) and that 9 n € argminegRp j|Y — 
X^||^. The size k BM is selected by minimizing the following penalized criterion 

k BM := arg inf ]|Y - X6 k \\l + a 2 pen(k) , (10.7) 
fee{i,...fe»}u{n} 

For short, we write 9 BM = 9^ BM . 

Observe that the estimator 9 BM requires the knowledge of the variance a 2 . Then, Eq. (5.6) is a 
special case of Theorem 1 in Birge and Massart [12]. 

11. Deviation inequalities 

The proofs of the deviation inequalities stated in this section are postponed to Appendix C in [43] . 
Lemma 11.1 (x 2 distributions). For any integer d > and any number < x < 1, 

P(x 2 (eO > d + 2v'dlog(l/a;) + 21og(l/a;)) < x , 
P(x 2 (d) < d - 2^/dlogCl/a:)) < x. 

For any positive number < x < 1 

P \x 2 {d) < dCx 2/d ] < x , (11.1) 



where the constant C = exp(— 1). 
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Lemma 11.2 (Wishart distributions). Let Z T Z be a standard Wishart matrix of parameters (n,d) 
with n > d. For any number < x < 1, 



P ¥W {Z T Z) >n(l+ ^d/n~ + ^2\og(l/x)/nj 
P tp min (Z T Z) <n(l- ^djn~ - v/2Iog(l/s)/n) 

For any (n, <i) wrai/i n > 4<i + 1 and any number < x < 1, 

Iog(2/x 



a; . 



1 V 



< a; , 



where C is a numerical constant. 



(11.2) 



(11.3) 



The two first deviation inequalities are taken from Theorem 2.13 in [19]. The bound (11.3) allows 
to control the tail distribution of the smallest eigenvalue of a Wishart distribution. Rudelson and 
Vershynin [41] have provided a control similar to (11.3) under subgaussian assumptions. However, 
their results only holds for events of probability smaller than 1 — e~ n . 
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